Fast Fourier Transforms (6x9 
Version) 



Collection Editor: 

C. Sidney Burrus 



Fast Fourier Transforms (6x9 

Version) 

Collection Editor: 

C. Sidney Burrus 

Authors: 

C. Sidney Burrus 

Matteo Frigo 

Steven G. Johnson 

Markus Pueschel 

Ivan Selesnick 

Online: 

< http ://cnx. org/content/coll 0683/ 1 . 5/ 

> 

CONNEXIONS 
Rice University, Houston, Texas 



This selection and arrangement of content as a collection is copyrighted 
by C. Sidney Burrus. It is licensed under the Creative Commons Attribu- 
tion 3.0 license (http://creativecommons.Org/licenses/by/3.0/). 
Collection structure revised: August 24, 2009 
PDF generated: May 7, 2012 

For copyright and attribution information for the modules contained in 
this collection, see p. 351. 



Table of Contents 



1 Preface: Fast Fourier Transforms 1 

2 Introduction: Fast Fourier Transforms 5 

3 Multidimensional Index Mapping 9 

4 Polynomial Description of Signals 27 

5 The DFT as Convolution or Filtering 35 

6 Factoring the Signal Processing Operators 51 

7 Winograd's Short DFT Algorithms 57 

8 DFT and FFT: An Algebraic View 89 

9 The Cooley-Tukey Fast Fourier Transform 

Algorithm 109 

10 The Prime Factor and Winograd Fourier 

Transform Algorithms 137 

11 Implementing FFTs in Practice 155 

12 Algorithms for Data with Restrictions 199 

13 Convolution Algorithms 203 

14 Comments: Fast Fourier Transforms 225 

15 Conclusions: Fast Fourier Transforms 23 1 

16 Appendix 1: FFT Flowgraphs 233 

17 Appendix 2: Operation Counts for Gen- 

eral Length FFT 243 

18 Appendix 3: FFT Computer Programs 247 

19 Appendix 4: Programs for Short FFTs 297 

Bibliography 301 

Index 349 

Attributions 351 



IV 



Chapter 1 

Preface: Fast Fourier 
Transforms 1 



This book focuses on the discrete Fourier transform (DFT), dis- 
crete convolution, and, particularly, the fast algorithms to calculate 
them. These topics have been at the center of digital signal pro- 
cessing since its beginning, and new results in hardware, theory 
and applications continue to keep them important and exciting. 

As far as we can tell, Gauss was the first to propose the techniques 
that we now call the fast Fourier transform (FFT) for calculating 
the coefficients in a trigonometric expansion of an asteroid's orbit 
in 1805 [174]. However, it was the seminal paper by Cooley and 
Tukey [88] in 1965 that caught the attention of the science and 
engineering community and, in a way, founded the discipline of 
digital signal processing (DSP). 

The impact of the Cooley-Tukey FFT was enormous. Problems 
could be solved quickly that were not even considered a few years 
earlier. A flurry of research expanded the theory and developed 
excellent practical programs as well as opening new applications 
[94]. In 1976, Winograd published a short paper [403] that set a 
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second flurry of research in motion [86]. This was another type 
of algorithm that expanded the data lengths that could be trans- 
formed efficiently and reduced the number of multiplications re- 
quired. The ground work for this algorithm had be set earlier by 
Good [148] and by Rader [308]. In 1997 Frigo and Johnson devel- 
oped a program they called the FFTW (fastest Fourier transform 
in the west) [130], [135] which is a composite of many of ideas 
in other algorithms as well as new results to give a robust, very 
fast system for general data lengths on a variety of computer and 
DSP architectures. This work won the 1999 Wilkinson Prize for 
Numerical Software. 

It is hard to overemphasis the importance of the DFT, convolu- 
tion, and fast algorithms. With a history that goes back to Gauss 
[174] and a compilation of references on these topics that in 1995 
resulted in over 2400 entries [362], the FFT may be the most im- 
portant numerical algorithm in science, engineering, and applied 
mathematics. New theoretical results still are appearing, advances 
in computers and hardware continually restate the basic questions, 
and new applications open new areas for research. It is hoped that 
this book will provide the background, references, programs and 
incentive to encourage further research and results in this area as 
well as provide tools for practical applications. 

Studying the FFT is not only valuable in understanding a powerful 
tool, it is also a prototype or example of how algorithms can be 
made efficient and how a theory can be developed to define opti- 
mality. The history of this development also gives insight into the 
process of research where timing and serendipity play interesting 
roles. 

Much of the material contained in this book has been collected 
over 40 years of teaching and research in DSP, therefore, it is dif- 
ficult to attribute just where it all came from. Some comes from 
my earlier FFT book [59] and some from the FFT chapter in [217]. 



Certainly the interaction with people like Jim Cooley and Char- 
lie Rader was central but the work with graduate students and 
undergraduates was probably the most formative. I would par- 
ticularly like to acknowledge Ramesh Agarwal, Howard Johnson, 
Mike Heideman, Henrik Sorensen, Doug Jones, Ivan Selesnick, 
Haitao Guo, and Gary Sitton. Interaction with my colleagues, Tom 
Parks, Hans Schuessler, Al Oppenheim, and Sanjit Mitra has been 
essential over many years. Support has come from the NSF, Texas 
Instruments, and the wonderful teaching and research environment 
at Rice University and in the IEEE Signal Processing Society. 

Several chapters or sections are written by authors who have exten- 
sive experience and depth working on the particular topics. Ivan 
Selesnick had written several papers on the design of short FFTs to 
be used in the prime factor algorithm (PFA) FFT and on automatic 
design of these short FFTs. Markus Pwschel has developed a theo- 
retical framework for "Algebraic Signal Processing" which allows 
a structured generation of FFT programs and a system called "Spi- 
ral" for automatically generating algorithms specifically for an ar- 
chiticture. Steven Johnson along with his colleague Matteo Frigo 
created, developed, and now maintains the powerful FFTW sys- 
tem: the Fastest Fourier Transform in the West. I sincerely thank 
these authors for their significant contributions. 

I would also like to thank Prentice Hall, Inc. who returned the 
copyright on The DFT as Convolution or Filtering (Chapter 5) of 
Advanced Topics in Signal Processing [49] around which some 
of this book is built. The content of this book is in the Connex- 
ions (http://cnx.org/content/coll0550/) repository and, therefore, 
is available for on-line use, pdf down loading, or purchase as a 
printed, bound physical book. I certainly want to thank Daniel 
Williamson, Amy Kavalewitz, and the staff of Connexions for their 
invaluable help. Additional FFT material can be found in Con- 
nexions, particularly content by Doug Jones [205], Ivan Selesnick 
[205], and Howard Johnson, [205]. Note that this book and all the 
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content in Connexions are copyrighted under the Creative Com- 
mons Attribution license (http://creativecommons.org/). 

If readers find errors in any of the modules of this collection or 
have suggestions for improvements or additions, please email the 
author of the collection or module. 



C. Sidney Burrus 
Houston, Texas 
October 20, 2008 



Chapter 2 

Introduction: Fast Fourier 
Transforms 1 



The development of fast algorithms usually consists of using spe- 
cial properties of the algorithm of interest to remove redundant or 
unnecessary operations of a direct implementation. Because of the 
periodicity, symmetries, and orthogonality of the basis functions 
and the special relationship with convolution, the discrete Fourier 
transform (DFT) has enormous capacity for improvement of its 
arithmetic efficiency. 

There are four main approaches to formulating efficient DFT [50] 
algorithms. The first two break a DFT into multiple shorter ones. 
This is done in Multidimensional Index Mapping (Chapter 3) by 
using an index map and in Polynomial Description of Signals 
(Chapter 4) by polynomial reduction. The third is Factoring the 
Signal Processing Operators (Chapter 6) which factors the DFT 
operator (matrix) into sparse factors. The DFT as Convolution or 
Filtering (Chapter 5) develops a method which converts a prime- 
length DFT into cyclic convolution. Still another approach is in- 
teresting where, for certain cases, the evaluation of the DFT can be 
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posed recursively as evaluating a DFT in terms of two half-length 
DFTs which are each in turn evaluated by a quarter-length DFT 
and so on. 

The very important computational complexity theorems of Wino- 
grad are stated and briefly discussed in Winograd's Short DFT 
Algorithms (Chapter 7). The specific details and evaluations of 
the Cooley-Tukey FFT and Split-Radix FFT are given in The 
Cooley-Tukey Fast Fourier Transform Algorithm (Chapter 9), and 
PFA and WFTA are covered in The Prime Factor and Winograd 
Fourier Transform Algorithms (Chapter 10). A short discussion of 
high speed convolution is given in Convolution Algorithms (Chap- 
ter 13), both for its own importance, and its theoretical connection 
to the DFT. We also present the chirp, Goertzel, QFT, NTT, SR- 
FFT, Approx FFT, Autogen, and programs to implement some of 
these. 

Ivan Selesnick gives a short introduction in Winograd's Short DFT 
Algorithms (Chapter 7) to using Winograd's techniques to give 
a highly structured development of short prime length FFTs and 
describes a program that will automaticlly write these programs. 
Markus Pueschel presents his "Algebraic Signal Processing" in 
DFT and FFT: An Algebraic View (Chapter 8) on describing the 
various FFT algorithms. And Steven Johnson describes the FFTW 
(Fastest Fourier Transform in the West) in Implementing FFTs in 
Practice (Chapter 11) 

The organization of the book represents the various approaches to 
understanding the FFT and to obtaining efficient computer pro- 
grams. It also shows the intimate relationship between theory and 
implementation that can be used to real advantage. The disparity 
in material devoted to the various approaches represent the tastes 
of this author, not any intrinsic differences in value. 

A fairly long list of references is given but it is impossible to be 
truly complete. I have referenced the work that I have used and 



that I am aware of. The collection of computer programs is also 
somewhat idiosyncratic. They are in Matlab and Fortran because 
that is what I have used over the years. They also are written pri- 
marily for their educational value although some are quite efficient. 
There is excellent content in the Connexions book by Doug Jones 
[206]. 
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Chapter 3 

Multidimensional Index 
Mapping 



A powerful approach to the development of efficient algorithms is 
to break a large problem into multiple small ones. One method for 
doing this with both the DFT and convolution uses a linear change 
of index variables to map the original one-dimensional problem 
into a multi-dimensional problem. This approach provides a uni- 
fied derivation of the Cooley-Tukey FFT, the prime factor algo- 
rithm (PFA) FFT, and the Winograd Fourier transform algorithm 
(WFTA) FFT. It can also be applied directly to convolution to break 
it down into multiple short convolutions that can be executed faster 
than a direct implementation. It is often easy to translate an algo- 
rithm using index mapping into an efficient program. 

The basic definition of the discrete Fourier transform (DFT) is 

C(k)= J>(n) W% k (3.1) 

where n, k, and N are integers, j = y/— 1, the basis functions are 
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the # roots of unity, 

W N = e~ j27t/N (3.2) 

an<U = 0,l,2,---,#-l. 

If the # values of the transform are calculated from the # val- 
ues of the data, x(n), it is easily seen that # 2 complex multipli- 
cations and approximately that same number of complex additions 
are required. One method for reducing this required arithmetic is 
to use an index mapping (a change of variables) to change the one- 
dimensional DFT into a two- or higher dimensional DFT. This is 
one of the ideas behind the very efficient Cooley-Tukey [89] and 
Winograd [404] algorithms. The purpose of index mapping is to 
change a large problem into several easier ones [46], [120]. This 
is sometimes called the "divide and conquer" approach [26] but a 
more accurate description would be "organize and share" which 
explains the process of redundancy removal or reduction. 

3.1 The Index Map 

For a length-N sequence, the time index takes on the values 

n = 0,l,2,...,N-l (3.3) 

When the length of the DFT is not prime, ,/V can be factored as 
# = #i#2 and two new independent variables can be defined over 
the ranges 

/ii = 0,1,2, ...,tfi-l (3.4) 

k 2 = 0,1,2, ...,tf 2 -l (3-5) 
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A linear change of variables is defined which maps n\ and n 2 to n 
and is expressed by 

n = ((K l n l +K 2 n 2 )) N (3.6) 

where K( are integers and the notation ((x)) N denotes the integer 
residue of x modulo N[232]. This map defines a relation between 
all possible combinations of n\ and n 2 in (3.4) and (3.5) and the 
values for n in (3.3). The question as to whether all of the n in 
(3.3) are represented, i.e., whether the map is one-to-one (unique), 
has been answered in [46] showing that certain integer K x always 
exist such that the map in (3.6) is one-to-one. Two cases must be 
considered. 

3.1.1 Case 1. 

N\ and N 2 are relatively prime, i.e., the greatest common divisor 
(N h N 2 ) = l. 

The integer map of (3.6) is one-to-one if and only if: 

{Ki = aN 2 ) and/or (K 2 = bNi) and (K u Ni)=(3.1) 

{K 2 ,N 2 ) = 1 

where a and b are integers. 

3.1.2 Case 2. 

N\ and N 2 are not relatively prime, i.e., (Ni,N 2 ) > 1. 
The integer map of (3.6) is one-to-one if and only if: 

{K x =aN 2 ) and (K 2 ^bNi) and (a,Ni) = (K 2 ,N 2 ) = l 

(3.8) 
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or 



(Ki / aN 2 ) and (AT 2 = WVi) and (tfi,#i) = (b,N 2 ) = 1 

(3.9) 
Reference [46] should be consulted for the details of these condi- 
tions and examples. Two classes of index maps are defined from 
these conditions. 



3.1.3 Type-One Index Map: 

The map of (3.6) is called a type-one map when integers a and b 
exist such that 

K x =aN 2 and K 2 = bNi (3.10) 



3.1.4 Type-Two Index Map: 

The map of (3.6) is called a type-two map when when integers a 
and b exist such that 

Ki=aN 2 or K 2 = bN\, but not both. (3.11) 

The type-one can be used only if the factors of N are relatively 
prime, but the type-two can be used whether they are relatively 
prime or not. Good [149], Thomas, and Winograd [404] all used 
the type-one map in their DFT algorithms. Cooley and Tukey 
[89] used the type-two in their algorithms, both for a fixed radix 
(N = R M ) and a mixed radix [301]. 

The frequency index is defined by a map similar to (3.6) as 

k={{K 3 h+K A k 2 )) N (3.12) 

where the same conditions, (3.7) and (3.8), are used for determin- 
ing the uniqueness of this map in terms of the integers ^3 and K4. 
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Two-dimensional arrays for the input data and its DFT are defined 
using these index maps to give 

A 

x(ni,n2)=x((K l ni+K 2 n2)) N (3.13) 



X(k ll k 2 )=X((K 3 k l +K 4 k 2 )) N (3.14) 

In some of the following equations, the residue reduction notation 
will be omitted for clarity. These changes of variables applied to 
the definition of the DFT given in (3.1) give 

# 2 _ INi — 1 
C(k)= I I x(n) W* lK3nikl W* lK * nikl W* lKi \t.Wx 2K * n2k2 

«2=0«1=0 

where all of the exponents are evaluated modulo N. 

The amount of arithmetic required to calculate (3 . 1 5) is the same as 
in the direct calculation of (3.1). However, because of the special 
nature of the DFT, the integer constants Kj can be chosen in such 
a way that the calculations are "uncoupled" and the arithmetic is 
reduced. The requirements for this are 

((K 1 K 4 )) N = and/or ((K 2 K 3 )) N = (3.16) 

When this condition and those for uniqueness in (3.6) are applied, 
it is found that the Kj may always be chosen such that one of the 
terms in (3.16) is zero. If the N( are relatively prime, it is always 
possible to make both terms zero. If the Nj are not relatively prime, 
only one of the terms can be set to zero. When they are relatively 
prime, there is a choice, it is possible to either set one or both to 
zero. This in turn causes one or both of the center two W terms in 
(3.15) to become unity. 
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An example of the Cooley-Tukey radix-4 FFT for a length- 16 DFT 
uses the type-two map with K\ — 4, K^ — 1, ^3 — 1, #4 — 4 giving 

n = 4«i + n 2 (3.17) 



£ = fci+4£ 2 (3.18) 

The residue reduction in (3.6) is not needed here since n does not 
exceed N as n\ and n 2 take on their values. Since, in this example, 
the factors of N have a common factor, only one of the conditions 
in (3.16) can hold and, therefore, (3.15) becomes 

C(*i,* 2 ) = C(*)= £ £ *(«) W? 1 * 1 <f' < 2 * 2 (3.19) 

«2— 0«i=0 

Note the definition of Ww in (3.3) allows the simple form of 

This has the form of a two-dimensional DFT with an extra term 
Wi6, called a "twiddle factor". The inner sum over n\ represents 
four length-4 DFTs, the Wk, term represents 16 complex multipli- 
cations, and the outer sum over n 2 represents another four length-4 
DFTs. This choice of the K{ "uncouples" the calculations since the 
first sum over n\ for n 2 = calculates the DFT of the first row of 

A 

the data array x («i,n 2 ), and those data values are never needed 
in the succeeding row calculations. The row calculations are inde- 
pendent, and examination of the outer sum shows that the column 
calculations are likewise independent. This is illustrated in Fig- 
ure 3.1. 
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X(k,,k 2 ) 



Figure 3.1: Uncoupling of the Row and Column Calculations 
(Rectangles are Data Arrays) 



The left 4-by-4 array is the mapped input data, the center array has 
the rows transformed, and the right array is the DFT array. The 
row DFTs and the column DFTs are independent of each other. 
The twiddle factors (TF) which are the center W in (3.19), are the 
multiplications which take place on the center array of Figure 3.1. 

This uncoupling feature reduces the amount of arithmetic required 
and allows the results of each row DFT to be written back over 
the input data locations, since that input row will not be needed 
again. This is called "in-place" calculation and it results in a large 
memory requirement savings. 

An example of the type-two map used when the factors of N are 
relatively prime is given for N = 15 as 



n — 5«i +«2 



(3.20) 



k — k\ + 3&2 



(3.21) 



16 



CHAPTER 3. MULTIDIMENSIONAL 
INDEX MAPPING 



The residue reduction is again not explicitly needed. Although the 
factors 3 and 5 are relatively prime, use of the type-two map sets 
only one of the terms in (3.16) to zero. The DFT in (3.15) becomes 



/-A 



w n 2 ki W n 2 k 2 



15 



(3.22) 



which has the same form as (3.19), including the existence of the 
twiddle factors (TF). Here the inner sum is five length-3 DFTs, one 
for each value of k\. This is illustrated in (3.2) where the rectangles 
are the 5 by 3 data arrays and the system is called a "mixed radix" 
FFT. 







k , 




n 2 






k , 






x(n,,n 2 ) 






x(k r n 2 ) 






X(k,,kJ 



Figure 3.2: Uncoupling of the Row and Column Calculations 
(Rectangles are Data Arrays) 



An alternate illustration is shown in Figure 3.3 where the rectan- 
gles are the short length 3 and 5 DFTs. 
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Figure 3.3: Uncoupling of the Row and Column Calculations 
(Rectangles are Short DFTs) 



The type-one map is illustrated next on the same length- 15 exam- 
ple. This time the situation of (3.7) with the "and" condition is 
used in (3.10) using an index map of 



n — 5«i + 3ft2 



(3.23) 



and 



k= IO&1+6&2 



(3.24) 



The residue reduction is now necessary. Since the factors of A^ are 
relatively prime and the type-one map is being used, both terms in 
(3.16) are zero, and (3.15) becomes 



A 

X- 



4 2 A 

EI* 

«2=0ni=0 



w. 



'nki Wt 



"2^2 



(3.25) 



which is similar to (3.22), except that now the type-one map gives 
a pure two-dimensional DFT calculation with no TFs, and the sums 
can be done in either order. Figures Figure 3.2 and Figure 3.3 also 
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describe this case but now there are no Twiddle Factor multiplica- 
tions in the center and the resulting system is called a "prime factor 
algorithm" (PFA). 

The purpose of index mapping is to improve the arithmetic effi- 
ciency. For example a direct calculation of a length- 16 DFT re- 
quires 16 A 2 or 256 real multiplications (recall, one complex multi- 
plication requires 4 real multiplications and 2 real additions) and an 
uncoupled version requires 144. A direct calculation of a length- 
15 DFT requires 225 multiplications but with a type-two map only 
135 and with a type-one map, 120. Recall one complex multipli- 
cation requires four real multiplications and two real additions. 

Algorithms of practical interest use short DFT's that require fewer 
than N 2 multiplications. For example, length-4 DFTs require no 
multiplications and, therefore, for the length- 16 DFT, only the TFs 
must be calculated. That calculation uses 16 multiplications, many 
fewer than the 256 or 144 required for the direct or uncoupled cal- 
culation. 

The concept of using an index map can also be applied to convolu- 
tion to convert a length N = N1N2 one-dimensional cyclic convolu- 
tion into a iVi by N2 two-dimensional cyclic convolution [46], [6]. 
There is no savings of arithmetic from the mapping alone as there 
is with the DFT, but savings can be obtained by using special short 
algorithms along each dimension. This is discussed in Algorithms 
for Data with Restrictions (Chapter 12) . 



3.2 In-Place Calculation of the DFT and 
Scrambling 

Because use of both the type-one and two index maps uncouples 
the calculations of the rows and columns of the data array, the re- 
sults of each short length N( DFT can be written back over the data 
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as it will not be needed again after that particular row or column 
is transformed. This is easily seen from Figures Figure 3.1, Fig- 
ure 3.2, and Figure 3.3 where the DFT of the first row of x (n\ , nj) 
can be put back over the data rather written into a new array. After 
all the calculations are finished, the total DFT is in the array of the 
original data. This gives a significant memory savings over using 
a separate array for the output. 

Unfortunately, the use of in-place calculations results in the order 
of the DFT values being permuted or scrambled. This is because 
the data is indexed according to the input map (3.6) and the results 
are put into the same locations rather than the locations dictated by 
the output map (3.12). For example with a length-8 radix-2 FFT, 
the input index map is 

n — An\+ 2^2 + ^3 (3.26) 

which to satisfy (3.16) requires an output map of 

k = ki + 2k 2 + 4k 3 (3.27) 

The in-place calculations will place the DFT results in the loca- 
tions of the input map and these should be reordered or unscram- 
bled into the locations given by the output map. Examination of 
these two maps shows the scrambled output to be in a "bit reversed" 
order. 

For certain applications, this scrambled output order is not impor- 
tant, but for many applications, the order must be unscrambled be- 
fore the DFT can be considered complete. Because the radix of 
the radix-2 FFT is the same as the base of the binary number rep- 
resentation, the correct address for any term is found by reversing 
the binary bits of the address. The part of most FFT programs that 
does this reordering is called a bit-reversed counter. Examples of 
various unscramblers are found in [146], [60] and in the appen- 
dices. 
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The development here uses the input map and the resulting algo- 
rithm is called "decimation-in-frequency". If the output rather than 
the input map is used to derive the FFT algorithm so the correct 
output order is obtained, the input order must be scrambled so that 
its values are in locations specified by the output map rather than 
the input map. This algorithm is called "decimation-in-time". The 
scrambling is the same bit-reverse counting as before, but it pre- 
cedes the FFT algorithm in this case. The same process of a post- 
unscrambler or pre-scrambler occurs for the in-place calculations 
with the type-one maps. Details can be found in [60], [56]. It is 
possible to do the unscrambling while calculating the FFT and to 
avoid a separate unscrambler. This is done for the Cooley-Tukey 
FFT in [192] and for the PFA in [60], [56], [319]. 

If a radix-2 FFT is used, the unscrambler is a bit-reversed counter. 
If a radix-4 FFT is used, the unscrambler is a base-4 reversed 
counter, and similarly for radix- 8 and others. However, if for the 
radix-4 FFT, the short length-4 DFTs (butterflies) have their out- 
puts in bit-revered order, the output of the total radix-4 FFT will 
be in bit-reversed order, not base-4 reversed order. This means any 
radix-2" FFT can use the same radix-2 bit-reversed counter as an 
unscrambler if the proper butterflies are used. 

3.3 Efficiencies Resulting from Index Map- 
ping with the DFT 

In this section the reductions in arithmetic in the DFT that result 
from the index mapping alone will be examined. In practical al- 
gorithms several methods are always combined, but it is helpful in 
understanding the effects of a particular method to study it alone. 

The most general form of an uncoupled two-dimensional DFT is 
given by 
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A/2-1 M-l 
X(fcl,fc2) = I { I *(ni,n 2 ) /i (wi,W2,*l)} fliM&lM) 

Jl2=0 «i=0 

where the inner sum calculates A^ length-A^ DFT's and, if for a 
type-two map, the effects of the TFs. If the number of arithmetic 
operations for a length-N DFT is denoted by F (N), the number 
of operations for this inner sum is F — N2F (N\). The outer sum 
which gives Ni length-A^ DFT's requires N[F (A^) operations. 
The total number of arithmetic operations is then 

F = N 2 F (N { ) + N t F (N 2 ) (3.29) 

The first question to be considered is for a fixed length N, what 
is the optimal relation of N\ and N 2 in the sense of minimizing the 
required amount of arithmetic. To answer this question, A 7 ! and N 2 
are temporarily assumed to be real variables rather than integers. If 
the short length-A 7 , DFT's in (3.28) and any TF multiplications are 
assumed to require Nf operations, i.e. F (Ni) — Nf, "Efficiencies 
Resulting from Index Mapping with the DFT" (Section 3.3: Effi- 
ciencies Resulting from Index Mapping with the DFT) becomes 

F = N 2 N 2 + N]_ Nl =N(N 1 +N 2 )=N(Ni+ AWf l ) (3 .30) 

To find the minimum of F over N\ , the derivative of F with re- 
spect to N\ is set to zero (temporarily assuming the variables to be 
continuous) and the result requires N\ = N 2 . 

dF/dN x =Q => Ni=N 2 (3.31) 

This result is also easily seen from the symmetry of A 7 ! and N 2 
in N — N\N 2 . If a more general model of the arithmetic complex- 
ity of the short DFT's is used, the same result is obtained, but a 
closer examination must be made to assure that N[ = N 2 is a global 
minimum. 

If only the effects of the index mapping are to be considered, then 
the F (N) = N 2 model is used and (3.31) states that the two factors 
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should be equal. If there are M factors, a similar reasoning shows 
that all M factors should be equal. For the sequence of length 

N = R M (3.32) 

there are now M length-R DFT's and, since the factors are all 
equal, the index map must be type two. This means there must be 
twiddle factors. 

In order to simplify the analysis, only the number of multiplica- 
tions will be considered. If the number of multiplications for a 
length-R DFT is F(R), then the formula for operation counts in 
(3.30) generalizes to 



M 

X 



F = N^F (Ni) /N t = NMF (R) /R (3.33) 

for Ni = R 



F = NlnR (N) F (R) /R = (NlnN) (F (R) / (RlnR)) (3.34) 

This is a very important formula which was derived by Cooley 
and Tukey in their famous paper [89] on the FFT. It states that for 
a given R which is called the radix, the number of multiplications 
(and additions) is proportional to NlnN. It also shows the relation 
to the value of the radix, R. 

In order to get some idea of the "best" radix, the number of multi- 
plications to compute a length-R DFT is assumed to be F (R) — R x . 
If this is used with (3.34), the optimal R can be found. 

dF/dR = => R = e ll{ - x - l) (3.35) 

For x — 2 this gives R — e, with the closest integer being three. 

The result of this analysis states that if no other arithmetic saving 
methods other than index mapping are used, and if the length-R 
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DFT's plus TFs require F — R 2 multiplications, the optimal algo- 
rithm requires 

F = 3Nlog 3 N (3.36) 

multiplications for a length N — 3 M DFT. Compare this with N 2 
for a direct calculation and the improvement is obvious. 

While this is an interesting result from the analysis of the effects of 
index mapping alone, in practice, index mapping is almost always 
used in conjunction with special algorithms for the short length-./V, 
DFT's in (3.15). For example, if R — 2 or 4, there are no mul- 
tiplications required for the short DFT's. Only the TFs require 
multiplications. Winograd (see Winorad's Short DFT Algorithms 
(Chapter 7)) has derived some algorithms for short DFT's that re- 
quire O (N) multiplications. This means that F (Ni) — KN( and 
the operation count F in "Efficiencies Resulting from Index Map- 
ping with the DFT" (Section 3.3: Efficiencies Resulting from In- 
dex Mapping with the DFT) is independent of AT,-. Therefore, the 
derivative of F is zero for all A 7 *. Obviously, these particular cases 
must be examined. 

3.4 The FFT as a Recursive Evaluation of 
the DFT 

It is possible to formulate the DFT so a length-N DFT can be cal- 
culated in terms of two length-(N/2) DFTs. And, if N = 2 M , each 
of those length-(N/2) DFTs can be found in terms of length-(N/4) 
DFTs. This allows the DFT to be calculated by a recursive algo- 
rithm with M recursions, giving the familiar order Nlog (N) arith- 
metic complexity. 
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Calculate the even indexed DFT values from (3.1) by: 

C(2k)= £*(/,) W$* = £*(n) wf /2 (3.37) 

n=0 n=0 

N/2-1 N-l 

C(2*) = £ x(n) W* nk + £ x(n) wf /2 (3.38) 

ra=0 n=Af/2 

N/2-1 

C(2*) = £ {*(n) + x(n + N/2)} W$ 2 (3.39) 

n=0 

and a similar argument gives the odd indexed values as: 

N/2-1 

C(2*+l) = £ {x(n) - x{n+N/2)}W$W$ 2 (3.40) 

Together, these are recursive DFT formulas expressing the length- 
N DFT of x{n) in terms of length-N/2 DFTs: 

C(2k)=DFT N/2 {x(n) + x(n + N/2)} (3.41) 



C(2k+l)=DFT N/2 {[x(n) - x(n + N/2)]W$} (3.42) 

This is a "decimation-in-frequency" (DIF) version since it gives 
samples of the frequency domain representation in terms of blocks 
of the time domain signal. 

A recursive Matlab program which implements this is given by: 
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function c = dftr2(x) 
% Recursive Decimation-in-Frequency FFT algorithm, csb 8/21/0 
L = length (x) ; 
if L > 1 

L2 = L/2; 

TF = exp(-j*2*pi/L).~[0:L2-l] ; 

cl = dftr2( x(l:L2) + x(L2+l:L)); 

c2 = dftr2((x(l:L2) - x(L2+l:L)) .*TF) ; 

cc = [cl';c2'] ; 

c = cc(:) ; 
else 

c = x; 
end 



Listing 3.1: DIF Recursive FFT for N = 2 



M 



A DIT version can be derived in the form: 

C(k) = DFT N/2 {x(2n)} + WJ^DFT N/2 {x(2n+ 1)} (3.43) 

C{k + N/2) = D¥T N/2 {x{2n)} - (3.44) 

WfrDFT N/2 {x(2n+l)} 

which gives blocks of the frequency domain from samples of the 
signal. 

A recursive Matlab program which implements this is given by: 
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function c = dftr(x) 
% Recursive Decimation-in-Time FFT algorithm, csb 
L = length (x) ; 
if L > 1 

L2 = L/2; 

ce = dftr(x(l:2:L-l)); 

co = dftr(x(2:2:L)) ; 

TF = exp(-j*2*pi/L).~[0:L2-l] ; 

cl = TF.*co; 

c = [(ce+cl) , (ce-cl)] ; 
else 

c = x; 
end 



Listing 3.2: DIT Recursive FFT for N = 2 



M 



Similar recursive expressions can be developed for other radices 
and and algorithms. Most recursive programs do not execute as ef- 
ficiently as looped or straight code, but some can be very efficient, 
e.g. parts of the FFTW. 

Note a length-2 M sequence will require M recursions, each of 
which will require N/2 multiplications. This give the Nlog (N) 
formula that the other approaches also derive. 



Chapter 4 

Polynomial Description of 
Signals 1 



Polynomials are important in digital signal processing because cal- 
culating the DFT can be viewed as a polynomial evaluation prob- 
lem and convolution can be viewed as polynomial multiplication 
[27], [261]. Indeed, this is the basis for the important results of 
Winograd discussed in Winograd's Short DFT Algorithms (Chap- 
ter 7). A length-N signal x(n) will be represented by an A^ — 1 
degree polynomial X (s) defined by 

X(s)= j>(») s n (4.1) 

This polynomial X (s) is a single entity with the coefficients being 
the values of x(n). It is somewhat similar to the use of matrix or 
vector notation to efficiently represent signals which allows use of 
new mathematical tools. 

The convolution of two finite length sequences, x(n) and h(n), 



'This content is available online at <http://cnx.org/content/ml6327/L8/>. 
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gives an output sequence defined by 

N-l 

y(n)= £*(*) h(n-k) (4.2) 

fc=0 

n = 0, 1,2,- •• ,2N- 1 where /» (Jt) = for Jt < 0. This is exactly 
the same operation as calculating the coefficients when multiplying 
two polynomials. Equation (4.2) is the same as 

Y(s)=X(s) H(s) (4.3) 

In fact, convolution of number sequences, multiplication of poly- 
nomials, and the multiplication of integers (except for the carry 
operation) are all the same operations. To obtain cyclic convolu- 
tion, where the indices in (4.2) are all evaluated modulo N, the 
polynomial multiplication in (4.3) is done modulo the polynomial 
P (s) = sr — 1 . This is seen by noting that N — mod N, therefore, 
s N — I and the polynomial modulus is s N — 1. 



4.1 Polynomial Reduction and the Chinese 
Remainder Theorem 

Residue reduction of one polynomial modulo another is defined 
similarly to residue reduction for integers. A polynomial F (s) has 
a residue polynomial R(s) modulo P(s) if, for a given F (s) and 
P(s),aQ (S) and R (s) exist such that 

F(s) = Q(s)P(s)+R(s) (4.4) 

with degree{R(s)} < degree{P(s)}. The notation that will be 
used is 

R( S ) = ((F( S ))) p{s) (4.5) 
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For example, 

(5+l)-((5 4 + 5 3 -5-l)) (?2 _ l) (4.6) 

The concepts of factoring a polynomial and of primeness are an 
extension of these ideas for integers. For a given allowed set of 
coefficients (values of x (ft)), any polynomial has a unique factored 
representation 

M 
F(s) = ]\Fi(s) ki (4.7) 

i=\ 

where the F{(s) are relatively prime. This is analogous to the 
fundamental theorem of arithmetic. 

There is a very useful operation that is an extension of the integer 
Chinese Remainder Theorem (CRT) which says that if the modulus 
polynomial can be factored into relatively prime factors 

P(s)=Pi(s) P 2 (s) (4.8) 

then there exist two polynomials, K\ (s) and K 2 (s), such that any 
polynomial F (s) can be recovered from its residues by 

F(s) = K l (s)F l {s)+K 2 {s)F 2 {s) mod P{s) (4.9) 

where F\ and F 2 are the residues given by 

F 1 ( J ) = ((F(*))) AW (4.10) 

and 

F 2 (*) = ((F(j))) ftW (4.11) 

if the order of F (s) is less than P(s). This generalizes to any 
number of relatively prime factors of P (s) and can be viewed as a 
means of representing F (s) by several lower degree polynomials, 
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This decomposition of F (s) into lower degree polynomials is the 
process used to break a DFT or convolution into several simple 
problems which are solved and then recombined using the CRT of 
(4.9). This is another form of the "divide and conquer" or "organize 
and share" approach similar to the index mappings in Multidimen- 
sional Index Mapping (Chapter 3). 

One useful property of the CRT is for convolution. If cyclic con- 
volution of x{n) and h (n) is expressed in terms of polynomials by 

Y(s)=H(s)X(s) modP(s) (4.12) 

where P(s) — s N — 1, and if P(s) is factored into two relatively 
prime factors P — P\P 2 , using residue reduction of H (s) and X (s) 
modulo Pi and P2, the lower degree residue polynomials can be 
multiplied and the results recombined with the CRT. This is done 
by 



Y(s) = ((K1H1X1 +K 2 H 2 X 2 )) P (4.13) 



where 



H X = ((H)) Pi , X X = ((X)) A , H 2 = (4.14) 
((H)) ^ X 2 = ((X)) Pi 

and K\ and K2 are the CRT coefficient polynomials from (4.9). 
This allows two shorter convolutions to replace one longer one. 

Another property of residue reduction that is useful in DFT calcu- 
lation is polynomial evaluation. To evaluate F (s) at s — x, F (s) is 
reduced modulo s — x. 

F(x) = ((F(s))) s _ x (4.15) 

This is easily seen from the definition in (4.4) 

F(s) = Q(s)(s-x)+R(s) (4.16) 
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Evaluating s — x gives R(s) — F (x) which is a constant. For the 
DFT this becomes 

C(k) = ((X(s))) s _ wk (4.17) 

Details of the polynomial algebra useful in digital signal process- 
ing can be found in [27], [233], [261]. 

4.2 The DFT as a Polynomial Evaluation 

The Z-transform of a number sequence x (n) is defined as 

X(z)= J>(») z" n (4.18) 

which is the same as the polynomial description in (4. 1) but with a 
negative exponent. For a finite length-N sequence (4.18) becomes 

N-i 
X(z)= J>(/i) z ~ n (4.19) 

X ( z ) = x(0) +x(l)z~ l +x(2)z~ 2 + ■ +x(N -l)z~ N+l (4.20) 

This A^ — 1 order polynomial takes on the values of the DFT of 
x (n) when evaluated at 

z = e j2nk,N (4.21) 

which gives 

N-\ 

C(k)=X (z) \ z = emm = J2 x (n) e~J 2mk / N (4.22) 

In terms of the positive exponent polynomial from (4.1), the DFT 
is 

C(k)=X(s)\ s=wk (4.23) 
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where 

W = e' j2n/N (4.24) 

is an N th root of unity (raising W to the N th power gives one). The 
N values of the DFT are found from X (s) evaluated at the N N 
roots of unity which are equally spaced around the unit circle in 
the complex s plane. 

One method of evaluating X (z) is the so-called Horner's rule or 
nested evaluation. When expressed as a recursive calculation, 
Horner's rule becomes the Goertzel algorithm which has some 
computational advantages especially when only a few values of 
the DFT are needed. The details and programs can be found in 
[272], [61] and The DFT as Convolution or Filtering: Goertzel's 
Algorithm (or A Better DFT Algorithm) (Section 5.3: Goertzel's 
Algorithm (or A Better DFT Algorithm)) 

Another method for evaluating X (s) is the residue reduction mod- 
ulo (s — W ) as shown in (4.17). Each evaluation requires N mul- 
tiplications and therefore, ,/V 2 multiplications for the ,/V values of 

C(k). 

C(k) = ((X(s))) {s _ wk) (4.25) 

A considerable reduction in required arithmetic can be achieved if 
some operations can be shared between the reductions for different 
values of k. This is done by carrying out the residue reduction in 
stages that can be shared rather than done in one step for each k in 
(4.25). 

The N values of the DFT are values of X (s) evaluated at s equal to 
the N roots of the polynomial P (s) — s N — 1 which are W k . First, 
assuming ,/V is even, factor P (s) as 

P(,s) = (s N - 1) -Pi (5) P 2 (s) = (>/ 2 - l) (>/ 2 + l) (4.26) 
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X (s) is reduced modulo these two factors to give two residue poly- 
nomials, X\ (s) and X2 (s). This process is repeated by factoring Pi 
and further reducing X\ then factoring P2 and reducing X2. This 
is continued until the factors are of first degree which gives the 
desired DFT values as in (4.25). This is illustrated for a length-8 
DFT The polynomial whose roots are W , factors as 

P(s) = s 8 -1 (4.27) 

= [j 4 -1][/+1] (4.28) 

= [(, 2 -l)(, 2 +l)][(, 2 - i -)(^ + i)] (4-29) 

= [(s-l)(s+l)(s-j)(s + j)][(s-a)(s + a)(s-{4dQis + ja)] 

where a 2 — j. Reducing X (s) by the first factoring gives two third 
degree polynomials 

X(s) =X +Xl5 + X 2 5 2 + ...+X 7 5 7 (4.31) 

gives the residue polynomials 

Xi(s) = ((X^)))^.!) = (*o+*4) + (4-32) 
(x\ + X5) S + (X2 + xe) s 2 + (X3 +x-j)s 3 

X 2 {s) = ((X(s))\ M) = (x -x 4 ) + (4.33) 
(x\ - X5) S + (X2 - x 6 ) s 2 + (X3 - Xj) s 3 

Two more levels of reduction are carried out to finally give the 
DFT. Close examination shows the resulting algorithm to be the 
decimation-in-frequency radix-2 Cooley-Tukey FFT [272], [61]. 
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Martens [227] has used this approach to derive an efficient DFT 
algorithm. 

Other algorithms and types of FFT can be developed using polyno- 
mial representations and some are presented in the generalization 
in DFT and FFT: An Algebraic View (Chapter 8). 



Chapter 5 

The DFT as Convolution or 
Filtering 



A major application of the FFT is fast convolution or fast filtering 
where the DFT of the signal is multiplied term-by-term by the DFT 
of the impulse (helps to be doing finite impulse response (FIR) fil- 
tering) and the time-domain output is obtained by taking the in- 
verse DFT of that product. What is less well-known is the DFT 
can be calculated by convolution. There are several different ap- 
proaches to this, each with different application. 

5.1 Rader's Conversion of the DFT into 
Convolution 

In this section a method quite different from the index mapping or 
polynomial evaluation is developed. Rather than dealing with the 
DFT directly, it is converted into a cyclic convolution which must 
then be carried out by some efficient means. Those means will 
be covered later, but here the conversion will be explained. This 
method requires use of some number theory, which can be found 
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C(k) 


N-l 
n=0 


w nk 


y(*) = 


N-l 

£*(n) h 

n=0 


(k — n 
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in an accessible form in [234] or [262] and is easy enough to verify 
on one's own. A good general reference on number theory is [259]. 

The DFT and cyclic convolution are defined by 

(5.1) 



(5.2) 



For both, the indices are evaluated modulo N. In order to convert 
the DFT in (5.1) into the cyclic convolution of (5.2), the nk prod- 
uct must be changed to the k — n difference. With real numbers, 
this can be done with logarithms, but it is more complicated when 
working in a finite set of integers modulo N. From number theory 
[28], [234], [262], [259], it can be shown that if the modulus is a 
prime number, a base (called a primitive root) exists such that a 
form of integer logarithm can be defined. This is stated in the fol- 
lowing way. If N is a prime number, a number r called a primitive 
roots exists such that the integer equation 

n = ((r m )) N (5.3) 

creates a unique, one-to-one map of the N — I member set m — 
{0,...,N-2} and the N- 1 member set n = {l,...,N- 1}. This is 
because the multiplicative group of integers modulo a prime, p, is 
isomorphic to the additive group of integers modulo (p—l) and is 
illustrated for N — 5 below. 
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3 
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2 
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4 


5 
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* 
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1 


1 


1 


1 


1 


1 



Table 5.1: Table of Integers n — ((r m )) modulo 5, [* not defined] 

Table 5.1 is an array of values of r m modulo N and it is easy to 
see that there are two primitive roots, 2 and 3, and (5.3) defines 
a permutation of the integers n from the integers m (except for 
zero). (5.3) and a primitive root (usually chosen to be the smallest 
of those that exist) can be used to convert the DFT in (5.1) to the 
convolution in (5.2). Since (5.3) cannot give a zero, a new length- 
(N-l) data sequence is defined from x(n) by removing the term 
with index zero. Let 



n — r 



(5.4) 



and 



k = r s 



(5.5) 



where the term with the negative exponent (the inverse) is defined 
as the integer that satisfies 

((^" Vm ));v= 1 (5-6) 

If N is a prime number, r~ m always exists. For example, 
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((2~ ! )) 5 = 3. (5.1) now becomes 
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m\ TT 7 r "'f 



x(0). 



(5.7) 



m=0 

for s = 0,l,.., N- 2, and 



N-l 

C(0)= J>(") 

n=0 



(5.8) 



New functions are defined, which are simply a permutation in the 
order of the original functions, as 



x'(m)=x(r- m ), C(s) = C(r s ) 1 w'(n) = W 



r" 



(5.7) then becomes 



N-2 

C'(s)= Y, x ( m ) w ( s ' 

m=0 



m\ 



x(0) 



(5.9) 



(5.10) 



which is cyclic convolution of length N-l (plus x(0)) and is de- 
noted as 



C(k)=x (k)*w'(k)+x(0) 



(5.11) 



Applying this change of variables (use of logarithms) to the DFT 
can best be illustrated from the matrix formulation of the DFT. 
(5.1) is written for a length-5 DFT as 



'C(O)" 




C(l) 




C(2) 


= 


C(3) 




_C(4)_ 







12 3 4 

2 4 13 

3 14 2 

4 3 2 1 



x(0) 
*(1) 
x(2) 
x(3) 
x(4) 



(5.12) 
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where the square matrix should contain the terms of W nk but for 
clarity, only the exponents nk are shown. Separating the x (0) term, 
applying the mapping of (5.9), and using the primitive roots r — 2 
(and r _1 — 3) gives 



13 4 2 

2 13 4 
4 2 13 

3 4 2 1 



'c(i)" 




C(2) 




C(4) 




. C(3) _ 







"*(!)" 




"*(0)" 




*(3) 


4- 


x(0) 




*(4) 




x(0) 




_*(2)_ 




_x(0)_ 



(5.13) 



and 



C(0)=x(0)+x(l)+x(2)+x(3)+x(4) 



(5.14) 



which can be seen to be a reordering of the structure in (5.12). 
This is in the form of cyclic convolution as indicated in (5.10). 
Rader first showed this in 1968 [234], stating that a prime length- 
N DFT could be converted into a length-(N-l) cyclic convolution 
of a permutation of the data with a permutation of the W's. He 
also stated that a slightly more complicated version of the same 
idea would work for a DFT with a length equal to an odd prime to 
a power. The details of that theory can be found in [234], [169]. 

Until 1976, this conversion approach received little attention since 
it seemed to offer few advantages. It has specialized applications in 
calculating the DFT if the cyclic convolution is done by distributed 
arithmetic table look-up [77] or by use of number theoretic trans- 
forms [28], [234], [262]. It and the Goertzel algorithm [273], [62] 
are efficient when only a few DFT values need to be calculated. 
It may also have advantages when used with pipelined or vector 
hardware designed for fast inner products. One example is the 
TMS320 signal processing microprocessor which is pipelined for 
inner products. The general use of this scheme emerged when new 
fast cyclic convolution algorithms were developed by Winograd 
[405]. 
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5.2 The Chirp Z-Transform (or Bluestein's 
Algorithm) 

The DFT of x (n) evaluates the Z-transform of x (n) on N equally 
spaced points on the unit circle in the z plane. Using a nonlinear 
change of variables, one can create a structure which is equivalent 
to modulation and filtering x(n) by a "chirp" signal. [34], [306], 
[298], [273], [304], [62]. 

The mathematical identity (k — n) — k 2 — 2kn + n 2 gives 

nk= (n 2 -(k-n) 2 + k 2 ) /2 (5.15) 

which substituted into the definition of the DFT in Multidimen- 
sional Index Mapping: Equation 1 (3.1) gives 

N-l 

C(k) = { £ \x(n) W n2 / 2 ] W- {k - n ?l 2 } W k2 ' 2 (5.16) 

This equation can be interpreted as first multiplying (modulating) 
the data x{n) by a chirp sequence (W n ' 2 , then convolving (filter- 
ing) it, then finally multiplying the filter output by the chirp se- 
quence to give the DFT. 

Define the chirp sequence or signal as h (n) — W n ' 2 which is called 
a chirp because the squared exponent gives a sinusoid with chang- 
ing frequency. Using this definition, (5.16) becomes 

C{n) = {[x{n) h(n)] * /T 1 } h(n) (5.17) 

We know that convolution can be carried out by multiplying the 
DFTs of the signals, here we see that evaluation of the DFT can 
be carried out by convolution. Indeed, the convolution represented 
by * in (5.17) can be carried out by DFTs (actually FFTs) of a 
larger length. This allows a prime length DFT to be calculated by 
a very efficient length-2 M FFT. This becomes practical for large ,/V 
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when a particular non-composite (or N with few factors) length is 
required. 

As developed here, the chirp z-transform evaluates the z-transform 
at equally spaced points on the unit circle. A slight modification 
allows evaluation on a spiral and in segments [298], [273] and al- 
lows savings with only some input values are nonzero or when only 
some output values are needed. The story of the development of 
this transform is given in [304]. 

Two Matlab programs to calculate an arbitrary length DFT using 
the chirp z-transform is shown in p. ??. 
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function y = chirpc(x); 
% function y = chirpc(x) 

% computes an arbitrary-length DFT with the 
'/, chirp z-transform algorithm, csb. 6/12/91 

'/. 

N = length(x); n = 0:N-1; 

W = exp(-j*pi*n.*n/N) ; 

xw = x.*W; 

WW = [conj(W(N:-l:2)),conj(W)] 

y = conv(WW,xw) ; 

y = y(N:2*M-l) .*W; 



"/.Sequence length 
°/ Chirp signal 
"/.Modulate with chirp 
°/ Construct filter 
°/ Convolve w filter 
°/ Demodulate w chirp 



function y = chirp (x); 

% function y = chirp (x) 

% computes an arbitrary-length Discrete Fourier Transform (DF 

% with the chirp z transform algorithm. The linear convolutio 

% then required is done with FFTs. 

*/. 1988: L. Arevalo; 11.06.91 K. Schwarz, LNT Erlangen; 6/12/9 

I 

N = length(x); "/.Sequence length 

L = 2~ceil(log((2*N-l))/log(2)); 7.FFT length 

n = 0:N-1; 

W = exp(-j*pi*n.*n/N) ; °/ Chirp signal 

FW = fft([conj(W), zeros(l,L-2*N+l) , conj (W(N: -1:2))] ,L) ; 

y = ifft(FW.*fft(x. ' . *W,D) ; "/.Convolve using FFT 

y =y(l:N).*W; "/.Demodulate 



Figure 5.1 
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5.3 GoertzePs Algorithm (or A Better DFT 
Algorithm) 

Goertzel's algorithm [144], [62], [269] is another methods that cal- 
culates the DFT by converting it into a digital filtering problem. 
The method looks at the calculation of the DFT as the evaluation of 
a polynomial on the unit circle in the complex plane. This evalua- 
tion is done by Horner's method which is implemented recursively 
by an IIR filter. 

5.3.1 The First-Order Goertzel Algorithm 

The polynomial whose values on the unit circle are the DFT is a 
slightly modified z-transform of x(n) given by 

X(z) = J>(n)z- n (5.18) 

which for clarity in this development uses a positive exponent . 
This is illustrated for a length-4 sequence as a third-order polyno- 
mial by 

X(z) = x(3)z 3 +x(2)z 2 + x(l)z + x(0) (5.19) 

The DFT is found by evaluating (5.18) at z — W k , which can be 
written as 

C{k)=X (z) \ z=wk = DFT{x (n) } (5.20) 

where 

W = e ~ j2n/N (5.21) 

The most efficient way of evaluating a general polynomial without 
any pre-processing is by "Horner's rule" [208] which is a nested 
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evaluation. This is illustrated for the polynomial in (5.19) by 

X(z) = {[x(3)z + x(2)]z + x(l)}z + x(0) (5.22) 

This nested sequence of operations can be written as a linear dif- 
ference equation in the form of 

y(m)=zy(m-l)+x(N-m) (5.23) 

with initial condition y (0) = 0, and the desired result being the 
solution atm — N. The value of the polynomial is given by 

X(z)=y(N). (5.24) 

(5.23) can be viewed as a first-order IIR filter with the input being 
the data sequence in reverse order and the value of the polynomial 
at z being the filter output sampled atm — N. Applying this to the 
DFT gives the Goertzel algorithm [283], [269] which is 

y(m) = W k y(m-l)+x(N-m) (5.25) 

with y (0) = and 

C(k)=y(N) (5.26) 

where 

JV-l 

C(k)= £jt(n) W nk . (5.27) 

The flowgraph of the algorithm can be found in [62], [269] and a 
simple FORTRAN program is given in the appendix. 

When comparing this program with the direct calculation of (5.27), 
it is seen that the number of floating-point multiplications and ad- 
ditions are the same. In fact, the structures of the two algorithms 
look similar, but close examination shows that the way the sines 
and cosines enter the calculations is different. In (5.27), new sine 
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and cosine values are calculated for each frequency and for each 
data value, while for the Goertzel algorithm in (5.25), they are cal- 
culated only for each frequency in the outer loop. Because of the 
recursive or feedback nature of the algorithm, the sine and cosine 
values are "updated" each loop rather than recalculated. This re- 
sults in 2N trigonometric evaluations rather than 2N 2 . It also re- 
sults in an increase in accumulated quantization error. 

It is possible to modify this algorithm to allow entering the data 
in forward order rather than reverse order. The difference (5.23) 
becomes 

y (m) =z~ l y(m-l)+x(m-l) (5.28) 

if (5.24) becomes 

C(k) = z*- l y(N) (5.29) 

for y (0) = 0. This is the algorithm programmed later. 

5.3.2 The Second-Order Goertzel Algorithm 

One of the reasons the first-order Goertzel algorithm does not im- 
prove efficiency is that the constant in the feedback or recursive 
path is complex and, therefore, requires four real multiplications 
and two real additions. A modification of the scheme to make it 
second-order removes the complex multiplications and reduces the 
number of required multiplications by two. 

Define the variable q (m) so that 

y(m) — q(m) —z~ q(m—l). (5.30) 

This substituted into the right-hand side of (5.23) gives 

y{m) — zq{m— 1) — q(m — 2) +x{N — m) . (5.31) 
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Combining (5.30) and (5.31) gives the second order difference 
equation 

q{m) = {z + z~ l ) q{m-l)-q(m-2)+x{N-m) (5.32) 

which together with the output (5.30), comprise the second-order 
Goertzel algorithm where 

X(z)=y(N) (5.33) 

for initial conditions q(0) — q (— 1) =0. 

A similar development starting with (5.28) gives a second-order 
algorithm with forward ordered input as 

q(m)= {z + z' 1 ) q(m- 1) - q(m-2)+x(m- 1) (5.34) 

y(m) — q(m)-zq(-l) (5.35) 

with 

X(z)=z N - 1 y(N) (5.36) 

andfor$(0) = $(-l) = 0. 

Note that both difference (5.32) and (5.34) are not changed if z is 
replaced with z" 1 , only the output (5.30) and (5.35) are different. 
This means that the polynomial X (z) may be evaluated at a partic- 
ular z and its inverse z _1 from one solution of the difference (5.32) 
or (5.34) using the output equations 

X(z) = q(N)-z~ l q(N-l) (5.37) 

and 

X(l/z)=z N ~ l (q(N)-zq(N-l)). (5.38) 

Clearly, this allows the DFT of a sequence to be calculated with 
half the arithmetic since the outputs are calculated two at a time. 
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The second-order DE actually produces a solution q (m) that con- 
tains two first-order components. The output equations are, in ef- 
fect, zeros that cancel one or the other pole of the second-order 
solution to give the desired first-order solution. In addition to al- 
lowing the calculating of two outputs at a time, the second-order 
DE requires half the number of real multiplications as the first- 
order form. This is because the coefficient of the q(m — 2) is unity 
and the coefficient of the q (m — 1) is real if z and z -1 are complex 
conjugates of each other which is true for the DFT 

5.3.3 Analysis of Arithmetic Complexity and Tim- 
ings 

Analysis of the various forms of the Goertzel algorithm from their 
programs gives the following operation count for real multiplica- 
tions and real additions assuming real data. 



Algorithm 


Real Mults. 


Real Adds 


Trig Eval. 


Direct DFT 


AN 2 


4N 2 


2N 2 


First-Order 


AN 2 


4N 2 -2N 


2N 


Second-Order 


2N 2 + 2N 


AN 2 


2N 


Second-Order 2 


N 2 +N 


2N 2 + N 


N 



Table 5.2 



Timings of the algorithms on a PC in milliseconds are given in the 
following table. 
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Algorithm 


N = 125 


N = 257 


Direct DFT 


4.90 


19.83 


First-Order 


4.01 


16.70 


Second-Order 


2.64 


11.04 


Second-Order 2 


1.32 


5.55 



Table 5.3 



These timings track the floating point operation counts fairly well. 



5.3.4 Conclusions 

Goertzel's algorithm in its first-order form is not particularly in- 
teresting, but the two-at-a-time second-order form is significantly 
faster than a direct DFT. It can also be used for any polynomial 
evaluation or for the DTFT at unequally spaced values or for eval- 
uating a few DFT terms. A very interesting observation is that the 
inner-most loop of the Glassman-Ferguson FFT [124] is a first- 
order Goertzel algorithm even though that FFT is developed in a 
very different framework. 

In addition to floating-point arithmetic counts, the number of 
trigonometric function evaluations that must be made or the size of 
a table to store precomputed values should be considered. Since the 
value of the W terms in (5.23) are iteratively calculate in the IIR 
filter structure, there is round-off error accumulation that should be 
analyzed in any application. 

It may be possible to further improve the efficiency of the second- 
order Goertzel algorithm for calculating all of the DFT of a number 
sequence. Perhaps a fourth order DE could calculate four output 
values at a time and they could be separated by a numerator that 
would cancel three of the zeros. Perhaps the algorithm could be 
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arranged in stages to give an N log (N) operation count. The cur- 
rent algorithm does not take into account any of the symmetries of 
the input index. Perhaps some of the ideas used in developing the 
QFT [53], [155], [158] could be used here. 

5.4 The Quick Fourier Transform (QFT) 

One stage of the QFT can use the symmetries of the sines and 
cosines to calculate a DFT more efficiently than directly imple- 
menting the definition Multidimensional Index Mapping: Equa- 
tion 1 (3.1). Similar to the Goertzel algorithm, the one-stage 
QFT is a better N 2 DFT algorithm for arbitrary lengths. See 
The Cooley-Tukey Fast Fourier Transform Algorithm: The Quick 
Fourier Transform, An FFT based on Symmetries (Section 9.4: 
The Quick Fourier Transform, An FFT based on Symmetries). 
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Chapter 6 

Factoring the Signal 
Processing Operators' 



A third approach to removing redundancy in an algorithm is to ex- 
press the algorithm as an operator and then factor that operator into 
sparse factors. This approach is used by Tolimieri [382], [384], Eg- 
ner [118], Selesnick, Elliott [121] and others. It is presented in a 
more general form in DFT and FFT: An Algebraic View (Chap- 
ter 8) The operators may be in the form of a matrix or a tensor 
operator. 

6.1 The FFT from Factoring the DFT Oper- 
ator 

The definition of the DFT in Multidimensional Index Mapping: 
Equation 1 (3.1) can written as a matrix-vector operation by C = 
WX which, for /V = 8 is 



lr This content is available online at <http://cnx.org/content/ml6330/!. 8/>. 
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~C(0)~ 




C(l) 




C(2) 




C(3) 




C(4) 




C(5) 




C(6) 




_C(7)_ 





wo 


W° 


w° 


w° 


w° 


w° 


w° 


w° 


wo 


W l 


w 2 


w 3 


w 4 


w 5 


w 6 


w 1 


wo 


w 2 


w 4 


w 6 


W8 


w 10 


w 12 


w u 


wo 
W° 


w 3 
w 4 


w 6 

W 8 


w 9 


w 12 

w 16 


w 15 
w 20 


W 18 w 21 
(6.1) 

W 24 W28 


W° 


w 5 


w 10 


w 15 


w 20 


w 25 


w 30 


w 35 


W° 


w 6 


w 12 


w 18 


w 24 


w 30 


w 36 


w 42 



w° W 7 W 14 W 21 W 28 W 35 W 42 W 49 



which clearly requires A^ 2 — 64 complex multiplications and 
N (N — 1) additions. A factorization of the DFT operator, W, gives 
W = Fi F2 F3 and C — F\ F2 F3 X or, expanded, 



'C(O)' 




C(4) 




C(2) 




C(6) 




C(l) 




C(5) 




C(3) 
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(6.2) 
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1 














1 











1 














1 











1 














1 
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V° 











-w° 














w l 











-w l 














w 2 











-w 2 














w 3 











-w 3 



x(0) 

jc(i; 

x{2) 

x(3' 
(6.3) ' 

x(4) 
x(5] 
x(6) 
x{T 



where the F[ matrices are sparse. Note that each has 16 (or 2N) 
non-zero terms and F2 and F3 have 8 (or N) non-unity terms. If 
jy _ >^M ^ t j ien t ^ e num { 5er f factors is log (N) — M. In another 

form with the twiddle factors separated so as to count the complex 
multiplications we have 



110 

1-10 

1 10 

1-10 

1 10 

1-10 

11 

1-1 



~C(0)~ 




C(4) 




C(2) 




C(6) 




C(l) 




C(5) 




C(3) 




_C(7). 





(6.4) 
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10 

10 

W° 

W 2 

10 

10 

w° 













000 w 2 



10 1 

10 10 
10-10000 
1 


10 

10-1 

10 



-10 

(6.5) 

10 1 



10 
10 
10 
10 











W° 

W l 

00000 W 2 

00000 w 3 



10 1 
10 1 
10 1 

10 

(6.6) 

10 0-10 
10 0-10 
10 0-1 
10 



which is in the form C — A\ M\ A2 M^ A3 X described by the in- 
dex map. A\, A2, and A3 each represents 8 additions, or, in general, 



55 



N additions. M\ and M2 each represent 4 (or N/2) multiplications. 

This is a very interesting result showing that implementing the 
DFT using the factored form requires considerably less arithmetic 
than the single factor definition. Indeed, the form of the formula 
that Cooley and Tukey derived showing that the amount of arith- 
metic required by the FFT is on the order of Nlog (N) can be seen 
from the factored operator formulation. 

Much of the theory of the FFT can be developed using operator 
factoring and it has some advantages for implementation of parallel 
and vector computer architectures. The eigenspace approach is 
somewhat of the same type [18]. 

6.2 Algebraic Theory of Signal Processing 
Algorithms 

A very general structure for all kinds of algorithms can be gener- 
alized from the approach of operators and operator decomposition. 
This is developed as "Algebraic Theory of Signal Processing" dis- 
cussed in the module DFT and FFT: An Algebraic View (Chap- 
ter 8) by Pi/schel and others [118]. 



56 CHAPTER 6. FACTORING THE SIGNAL 

PROCESSING OPERATORS 



Chapter 7 

Winograd's Short DFT 
Algorithms 1 



In 1976, S. Winograd [406] presented a new DFT algorithm which 
had significantly fewer multiplications than the Cooley-Tukey FFT 
which had been published eleven years earlier. This new Wino- 
grad Fourier Transform Algorithm (WFTA) is based on the type- 
one index map from Multidimensional Index Mapping (Chapter 3) 
with each of the relatively prime length short DFT's calculated by 
very efficient special algorithms. It is these short algorithms that 
this section will develop. They use the index permutation of Rader 
described in the another module to convert the prime length short 
DFT's into cyclic convolutions. Winograd developed a method for 
calculating digital convolution with the minimum number of mul- 
tiplications. These optimal algorithms are based on the polynomial 
residue reduction techniques of Polynomial Description of Signals: 
Equation 1 (4.1) to break the convolution into multiple small ones 
[29], [235], [263], [416], [408], [197]. 



lr rhis content is available online at <http://cnx.org/content/ml6333/L14/>. 
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The operation of discrete convolution defined by 

y(n)=Yfi(n-k) x ( k ) C 7 - 1 ) 

k 

is called a bilinear operation because, for a fixed h(n),y (n) is a 
linear function of x (n) and for a fixed x (n) it is a linear function of 
h (n). The operation of cyclic convolution is the same but with all 
indices evaluated modulo N. 

Recall from Polynomial Description of Signals: Equation 3 (4.3) 
that length-N cyclic convolution of x (n) and h (n) can be repre- 
sented by polynomial multiplication 

Y(s)=X(s) H(s) mod (s N - l) (7.2) 

This bilinear operation of (7.1) and (7.2) can also be expressed 
in terms of linear matrix operators and a simpler bilinear opera- 
tor denoted by o which may be only a simple element-by-element 
multiplication of the two vectors [235], [197], [212]. This matrix 
formulation is 

Y = C[AXoBH] (7.3) 

where X, H and Y are length-N vectors with elements of x(n), 
h (n) and y (n) respectively. The matrices A and B have dimension 
M x N , and C is N x M with M >N. The elements of A, B, and 
C are constrained to be simple; typically small integers or rational 
numbers. It will be these matrix operators that do the equivalent of 
the residue reduction on the polynomials in (7.2). 

In order to derive a useful algorithm of the form (7.3) to calculate 
(7.1), consider the polynomial formulation (7.2) again. To use the 
residue reduction scheme, the modulus is factored into relatively 
prime factors. Fortunately the factoring of this particular poly- 
nomial, s N — 1, has been extensively studied and it has consider- 
able structure. When factored over the rationals, which means that 
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the only coefficients allowed are rational numbers, the factors are 
called cyclotomic polynomials [29], [235], [263]. The most inter- 
esting property for our purposes is that most of the coefficients of 
cyclotomic polynomials are zero and the others are plus or minus 
unity for degrees up to over one hundred. This means the residue 
reduction will generally require no multiplications. 

The operations of reducing X (s) and H (s) in (7.2) are carried out 
by the matrices A and B in (7.3). The convolution of the residue 
polynomials is carried out by the o operator and the recombination 
by the CRT is done by the C matrix. More details are in [29], [235], 
[263], [197], [212] but the important fact is the A and B matrices 
usually contain only zero and plus or minus unity entries and the 
C matrix only contains rational numbers. The only general mul- 
tiplications are those represented by o. Indeed, in the theoretical 
results from computational complexity theory, these real or com- 
plex multiplications are usually the only ones counted. In practical 
algorithms, the rational multiplications represented by C could be 
a limiting factor. 

The h (ft) terms are fixed for a digital filter, or they represent the W 
terms from Multidimensional Index Mapping: Equation 1 (3.1) if 
the convolution is being used to calculate a DFT Because of this, 
d — BH in (7.3) can be precalculated and only the A and C opera- 
tors represent the mathematics done at execution of the algorithm. 
In order to exploit this feature, it was shown [416], [197] that the 
properties of (7.3) allow the exchange of the more complicated op- 
erator C with the simpler operator B. Specifically this is given by 

Y = C[AXoBH] (7.4) 



Y =B T 



AXoC T H 



(7.5) 



where W has the same elements as H, but in a permuted order, 
and likewise Y' and Y. This very important property allows pre- 
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computing the more complicated C T H' in (7.5) rather than BH as 
in (7.3). 

Because BH or C T H' can be precomputed, the bilinear form of 
(7.3) and (7.5) can be written as a linear form. If an M x M diagonal 
matrix D is formed from d — C T H, or in the case of (7.3), d — BH, 
assuming a commutative property for o, (7.5) becomes 



Y = B T DAX (7.6) 



and (7.3) becomes 



Y = CD AX (7.7) 

In most cases there is no reason not to use the same reduction 
operations on X and H, therefore, B can be the same as A and (7.6) 
then becomes 

Y =A T DAX (7.8) 

In order to illustrate how the residue reduction is carried out and 
how the A matrix is obtained, the length-5 DFT algorithm started 
in The DFT as Convolution or Filtering: Matrix 1 (5.12) will be 
continued. The DFT is first converted to a length-4 cyclic convo- 
lution by the index permutation from The DFT as Convolution or 
Filtering: Equation 3 (5.3) to give the cyclic convolution in The 
DFT as Convolution or Filtering (Chapter 5). To avoid confusion 
from the permuted order of the data x (n) in The DFT as Convo- 
lution or Filtering (Chapter 5), the cyclic convolution will first be 
developed without the permutation, using the polynomial U (s) 

U (s) = jc(1) + jc(3) s + x(4) s 2 +x(2) s 3 (7.9) 

U(s) = u(0) + u(l) s + u(2) s 2 + u(3) s 3 (7.10) 
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and then the results will be converted back to the permuted x(n). 
The length-4 cyclic convolution in terms of polynomials is 

Y(s) = U(s) H(s) mod (/-l) (7.11) 

and the modulus factors into three cyclotomic polynomials 

54-1 = (s 2 -l) (s 2 +l) (7.12) 

= (s-l)(s+l)(s 2 + l) (7.13) 

= P l P 2 P 3 (7.14) 

Both U (s) and H (s) are reduced modulo these three polynomials. 
The reduction modulo Pi and P 2 is done in two stages. First it is 
done modulo (s 2 — l) , then that residue is further reduced modulo 

(5-1) and (5+1). 

U(s) = u0 + uls + u 2 s 2 + u 3 s 3 (7.15) 

U (s) = ((U (s)))^_^ = (w + u 2 ) + («i + u 3 )s (7.16) 

Ul(s)=((u'(sty =(u + u 1 + u 2 + u 3 ) (7.17) 

U2(s)= ((^>))) = {uo-m + u 2 -u 3 ) (7.18) 



m (s) = ((U (s))) P3 = (w - u 2 ) + («i - u 3 ) s (7.19) 

The reduction in (7.16) of the data polynomial (7.15) can be de- 
noted by a matrix operation on a vector which has the data as en- 
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10 10 
10 1 



and the reduction in (7.19) is 



10-10 
10-1 



«0 

111 

"0 
U\ 
112 
"3 



UQ + U2 
U\ + W3 



«0 — U2 
U\ — «3 



Combining (7.20) and (7.21) gives one operator 



(7.20) 



(7.21) 



10 10 
10 1 
10-10 
10-1 

Uq + U 2 
U\ + «3 
Uq — «2 
U\ — M3 



U0 + U2 
U\ +M3 
Uq — «2 
U\ — M3 



W 
W\ 

vo 



= (7.22) 
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Further reduction of vo + vis is not possible because P3 — s 2 + 1 
cannot be factored over the rationals. However s 2 — 1 can be fac- 
tored into P1P2 — (s — 1) (s + 1) and, therefore, wq + w\s can be 
further reduced as was done in (7.17) and (7.18) by 



1 1 



1 -1 



wo 

VV'l 

w 



— wo + wi = U0 + U2 + U1 + «3 (7.23) 



— Wq — W[—Uo + U2 — U[—U3 (7.24) 



Combining (7.22), (7.23) and (7.24) gives 
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(7.25) 
The same reduction is done to H (s) and then the convolution of 
(7.1 1) is done by multiplying each residue polynomial of X (s) and 
H (5) modulo each corresponding cyclotomic factor of P (s) and 
finally a recombination using the polynomial Chinese Remainder 
Theorem (CRT) as in Polynomial Description of Signals: Equation 
9 (4.9) and Polynomial Description of Signals: Equation 13 (4.13). 



Y(s) - Ki(s)Ui(s)Hi(s) + (7.26) 

K 2 (s) U 2 (s) H 2 (s) + K 3 (s) U 3 (s) H 3 (s) 

mod (^ 4 — l) 

where U\ (s) — r\ and Ui (s) — r2 are constants and t/3 (s) — vo + 
vi* is a first degree polynomial. U\ times H\ and U2 times H2 are 
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easy, but multiplying U3 time H3 modulo (s 2 + 1) is more difficult. 

The multiplication of I/3 (s) times H3 (s) can be done by the Toom- 
Cook algorithm [29], [235], [263] which can be viewed as La- 
grange interpolation or polynomial multiplication modulo a spe- 
cial polynomial with three arbitrary coefficients. To simplify the 
arithmetic, the constants are chosen to be plus and minus one and 
zero. The details of this can be found in [29], [235], [263]. For this 
example it can be verified that 



((vO + vl,s)(/iO + /il,s)) iS 2 +1 = (voho-vihi)+ (7.27) 
which by the Toom-Cook algorithm or inspection is 



1 

1 1 



1 
1 

1 1 






o 
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1 1 



h 

(7.£8j) 

/11 



) ; 

) ; 1 



where o signifies point-by-point multiplication. The total A matrix 
in (7.3) is a combination of (7.25) and (7.28) giving 



AX=AiA 2 A 3 X 



(7.29) 
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where the matrix A3 gives the residue reduction s 2 — 1 and s 2 + 1 , 
the upper left-hand part of A2 gives the reduction modulo s — 1 and 
s + 1, and the lower right-hand part of Al carries out the Toom- 
Cook algorithm modulo s 2 + 1 with the multiplication in (7.5). No- 
tice that by calculating (7.30) in the three stages, seven additions 
are required. Also notice that A[ is not square. It is this "expan- 
sion" that causes more than N multiplications to be required in o in 
(7.5) or D in (7.6). This staged reduction will derive the A operator 
for (7.5) 

The method described above is very straight-forward for the 
shorter DFT lengths. For N — 3, both of the residue polynomials 
are constants and the multiplication given by o in (7.3) is trivial. 
For N — 5, which is the example used here, there is one first degree 
polynomial multiplication required but the Toom-Cook algorithm 
uses simple constants and, therefore, works well as indicated in 
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(7.28). For N — 7, there are two first degree residue polynomials 
which can each be multiplied by the same techniques used in the 
N — 5 example. Unfortunately, for any longer lengths, the residue 
polynomials have an order of three or greater which causes the 
Toom-Cook algorithm to require constants of plus and minus two 
and worse. For that reason, the Toom-Cook method is not used, 
and other techniques such as index mapping are used that require 
more than the minimum number of multiplications, but do not re- 
quire an excessive number of additions. The resulting algorithms 
still have the structure of (7.8). Blahut [29] and Nussbaumer [263] 
have a good collection of algorithms for polynomial multiplication 
that can be used with the techniques discussed here to construct a 
wide variety of DFT algorithms. 

The constants in the diagonal matrix D can be found from the CRT 
matrix C in (7.5) using d — C T H' for the diagonal terms in D. 
As mentioned above, for the smaller prime lengths of 3, 5, and 
7 this works well but for longer lengths the CRT becomes very 
complicated. An alternate method for finding D uses the fact that 
since the linear form (7.6) or (7.8) calculates the DFT, it is possible 
to calculate a known DFT of a given x (n) from the definition of 
the DFT in Multidimensional Index Mapping: Equation 1 (3.1) 
and, given the A matrix in (7.8), solve for D by solving a set of 
simultaneous equations. The details of this procedure are described 
in [197]. 

A modification of this approach also works for a length which is 
an odd prime raised to some power: N = P M . This is a bit more 
complicated [235], [416] but has been done for lengths of 9 and 
25. For longer lengths, the conventional Cooley-Tukey type- two 
index map algorithm seems to be more efficient. For powers of 
two, there is no primitive root, and therefore, no simple conversion 
of the DFT into convolution. It is possible to use two generators 
[235], [263], [408] to make the conversion and there exists a set of 
length 4, 8, and 16 DFT algorithms of the form in (7.8) in [235]. 
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In Table 7.1 an operation count of several short DFT algorithms is 
presented. These are practical algorithms that can be used alone 
or in conjunction with the index mapping to give longer DFT's as 
shown in The Prime Factor and Winograd Fourier Transform Al- 
gorithms (Chapter 10). Most are optimized in having either the 
theoretical minimum number of multiplications or the minimum 
number of multiplications without requiring a very large number of 
additions. Some allow other reasonable trade-offs between num- 
bers of multiplications and additions. There are two lists of the 
number of multiplications. The first is the number of actual float- 
ing point multiplications that must be done for that length DFT. 
Some of these (one or two in most cases) will be by rational con- 
stants and the others will be by irrational constants. The second 
list is the total number of multiplications given in the diagonal ma- 
trix D in (7.8). At least one of these will be unity ( the one as- 
sociated with X (0)) and in some cases several will be unity ( for 
N — 2 M ). The second list is important in programming the WFTA 
in The Prime Factor and Winograd Fourier Transform Algorithm: 
The Winograd Fourier Transform Algorithm (Section 10.2: The 
Winograd Fourier Transform Algorithm). 
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Table 7.1: Number of Real Multiplications and Additions for a 
Length-N DFT of Complex Data 

Because of the structure of the short DFTs, the number of real 
multiplications required for the DFT of real data is exactly half that 
required for complex data. The number of real additions required 
is slightly less than half that required for complex data because 
(N — 1) of the additions needed when N is prime add a real to an 
imaginary, and that is not actually performed. When N — 2m, there 
are (N — 2) of these pseudo additions. The special case for real data 
is discussed in [101], [177], [356]. 



69 



The structure of these algorithms are in the form of X — CD AX or 
B T DAX or A T DAX from (7.5) and (7.8). The A and B matrices are 
generally Mby N with M > N and have elements that are integers, 
generally or ±1. A pictorial description is given in Figure 7.1. 




Figure 7.1: Flow Graph for the Length-5 DFT 




Figure 7.2: Block Diagram of a Winograd Short DFT 
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The flow graph in Figure 7.1 should be compared with the ma- 
trix description of (7.8) and (7.30), and with the programs in [29], 
[235], [63], [263] and the appendices. The shape in Figure 7.2 
illustrates the expansion of the data by A. That is to say, AX has 
more entries than X because M > N. The A operator consists of ad- 
ditions, the D operator gives the M multiplications (some by one) 
and A T contracts the data back to N values with additions only. M 
is one half the second list of multiplies in Table 7.1. 

An important characteristic of the D operator in the calculation 
of the DFT is its entries are either purely real or imaginary. The 

reduction of the W vector by (V^" 1 )/ 2 - l) and (s^'^^+l) 
separates the real and the imaginary constants. This is discussed 
in [416], [197]. The number of multiplications for complex data is 
only twice those necessary for real data, not four times. 

Although this discussion has been on the calculation of the DFT, 
very similar results are true for the calculation of convolution and 
correlation, and these will be further developed in Algorithms for 
Data with Restrictions (Chapter 12). The A T DA structure and the 
picture in Figure 7.2 are the same for convolution. Algorithms and 
operation counts can be found in [29], [263], [7]. 



7.1 The Bilinear Structure 

The bilinear form introduced in (7.3) and the related linear form in 
(7.6) are very powerful descriptions of both the DFT and convolu- 
tion. 

Bilinear: Y = C[AX o BH] (7.3 1) 

Linear: Y = CDA X (7.32) 

Since (7.31) is a bilinear operation defined in terms of a second 
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bilinear operator o , this formulation can be nested. For example if 
o is itself defined in terms of a second bilinear operator @ , by 



X o H = C 
then (7.31) becomes 



AX @ B H 



Y = CC 



A AX @ BBH 



(7.33) 



(7.34) 



For convolution, if A represents the polynomial residue reduction 
modulo the cyclotomic polynomials, then A is square (e.g. (7.25) 
and o represents multiplication of the residue polynomials modulo 
the cyclotomic polynomials. If A represents the reduction modulo 
the cyclotomic polynomials plus the Toom-Cook reduction as was 
the case in the example of (7.30), then A is NxM and o is term-by- 
term simple scalar multiplication. In this case AX can be thought 
of as a transform of X and C is the inverse transform. This is called 
a rectangular transform [7] because A is rectangular. The transform 
requires only additions and convolution is done with M multiplica- 
tions. The other extreme is when A represents reduction over the N 
complex roots of s N — 1 . In this case A is the DFT itself, as in the 
example of (43), and o is point by point complex multiplication 
and C is the inverse DFT. A trivial case is where A, B and C are 
identity operators and o is the cyclic convolution. 

This very general and flexible bilinear formulation coupled with 
the idea of nesting in (7.34) gives a description of most forms of 
convolution. 



7.2 Winograd's Complexity Theorems 

Because Winograd's work [29], [235], [416], [408], [413], [419] 
has been the foundation of the modern results in efficient convolu- 
tion and DFT algorithms, it is worthwhile to look at his theoretical 
conclusions on optimal algorithms. Most of his results are stated 
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in terms of polynomial multiplication as Polynomial Description 
of Signals: Equation 3 (4.3) or (7.11). The measure of compu- 
tational complexity is usually the number of multiplications, and 
only certain multiplications are counted. This must be understood 
in order not to misinterpret the results. 

This section will simply give a statement of the pertinent results 
and will not attempt to derive or prove anything. A short interpre- 
tation of each theorem will be given to relate the result to the algo- 
rithms developed in this chapter. The indicated references should 
be consulted for background and detail. 

Theorem 1 [416] Given two polynomials, x (s) and h (s), of degree 
N and M respectively, each with indeterminate coefficients that are 
elements of a field H, N + M + 1 multiplications are necessary to 
compute the coefficients of the product polynomial x (s) h (s) . Mul- 
tiplication by elements of the field G (the field of constants), which 
is contained in H, are not counted and G contains at least N + M 
distinct elements. 

The upper bound in this theorem can be realized by choosing an 
arbitrary modulus polynomial P (s) of degree N + M + 1 composed 
of N + M + 1 distinct linear polynomial factors with coefficients in 
G which, since its degree is greater than the product x (s) h (s), has 
no effect on the product, and by reducing x (s) and h(s) to N+M + 
1 residues modulo the N + M + 1 factors of P(s). These residues 
are multiplied by each other, requiring N + M + 1 multiplications, 
and the results recombined using the Chinese remainder theorem 
(CRT). The operations required in the reduction and recombination 
are not counted, while the residue multiplications are. Since the 
modulus P (s) is arbitrary, its factors are chosen to be simple so as 
to make the reduction and CRT simple. Factors of zero, plus and 
minus unity, and infinity are the simplest. Plus and minus two and 
other factors complicate the actual calculations considerably, but 
the theorem does not take that into account. This algorithm is a 
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form of the Toom-Cook algorithm and of Lagrange interpolation 
[29], [235], [263], [416]. For our applications, H is the field of 
reals and G the field of rationals. 

Theorem 2 [416] If an algorithm exists which computes x(s) h (s) 
in N + M + 1 multiplications, all but one of its multiplication steps 
must necessarily be of the form 

mk= (gk'+x(gkj) (gk" + h(gk)) for k = 0, l,...,N + M 

(7.35) 
where gk are distinct elements of G; and g k and gk' are arbitrary 
elements of G 

This theorem states that the structure of an optimal algorithm is 
essentially unique although the factors of P (s) may be chosen ar- 
bitrarily. 

Theorem 3 [416] Let P (s) be a polynomial of degree N and be of 
the form P(s) = Q (s) k, where Q (s) is an irreducible polynomial 
with coefficients in G and k is a positive integer. Let x (s) and h(s) 
be two polynomials of degree at least N — I with coefficients from 
H, then 2N — 1 multiplications are required to compute the product 
x(s)h(s) modulo P (s) . 

This theorem is similar to Theorem 1 (p. 72) with the operations 
of the reduction of the product modulo P(s) not being counted. 

Theorem 4 [416] Any algorithm that computes the product 
x(s)h(s) modulo P(s) according to the conditions stated in The- 
orem 3 and requires 2N — 1 multiplications will necessarily be of 
one of three structures, each of which has the form of Theorem 2 
internally. 

As in Theorem 2 (p. 73), this theorem states that only a limited 
number of possible structures exist for optimal algorithms. 
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Theorem 5 [416] If the modulus polynomial P(s) has degree N 
and is not irreducible, it can be written in a unique factored form 
P(s) — Pj" 1 (s)P^ 2 (s) ■ ■■P™ k (s) where each of the Pi (s) are irre- 
ducible over the allowed coefficient field G. 2N — k multiplica- 
tions are necessary to compute the product x (s) h (s) modulo P (s) 
where x (s) and h (s) have coefficients in H and are of degree at 
least N — I. All algorithms that calculate this product in 2N — k 
multiplications must be of a form where each of the k residue poly- 
nomials of x (s) and h (s) are separately multiplied modulo the fac- 
tors of P (s) via the CRT. 

Corollary: If the modulus polynomial is P (s) — s N — 1, then IN — 
t (N) multiplications are necessary to compute x (s) h (s) modulo 
P(s), where t (N) is the number of positive divisors of N. 

Theorem 5 (p. 73) is very general since it allows a general mod- 
ulus polynomial. The proof of the upper bound involves reducing 
x(s) and h(s) modulo the k factors of P(s). Each of the k irre- 
ducible residue polynomials is then multiplied using the method 
of Theorem 4 (p. 73) requiring 2Ni — 1 multiplies and the prod- 
ucts are combined using the CRT. The total number of multiplies 
from the k parts is 2N — k. The theorem also states the structure 
of these optimal algorithms is essentially unique. The special case 
of P(s) — s N — 1 is interesting since it corresponds to cyclic con- 
volution and, as stated in the corollary, k is easily determined. The 
factors of s N — 1 are called cyclotomic polynomials and have inter- 
esting properties [29], [235], [263]. 

Theorem 6 [416], [408] Consider calculating the DFT of a prime 
length real-valued number sequence. If G is chosen as the field 
of rational numbers, the number of real multiplications necessary 
to calculate a length-P DFT is u (DFT (N)) = 2P - 3 - 1 (P - 1) 
where t (P — 1) is the number of divisors of P — 1 . 

This theorem not only gives a lower limit on any practical prime 
length DFT algorithm, it also gives practical algorithms for ,/V — 
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3,5, and 7. Consider the operation counts in Table 7.1 to under- 
stand this theorem. In addition to the real multiplications counted 
by complexity theory, each optimal prime-length algorithm will 
have one multiplication by a rational constant. That constant cor- 
responds to the residue modulo (s-1) which always exists for the 
modulus P(s) — s N ~ [ — 1. In a practical algorithm, this multipli- 
cation must be carried out, and that accounts for the difference 
in the prediction of Theorem 6 (p. 74) and count in Table 7.1. 
In addition, there is another operation that for certain applications 
must be counted as a multiplication. That is the calculation of the 
zero frequency term X (0) in the first row of the example in The 
DFT as Convolution or Filtering: Matrix 1 (5.12). For applica- 
tions to the WFTA discussed in The Prime Factor and Winograd 
Fourier Transform Algorithms: The Winograd Fourier Transform 
Algorithm (Section 10.2: The Winograd Fourier Transform Algo- 
rithm), that operation must be counted as a multiply. For lengths 
longer than 7, optimal algorithms require too many additions, so 
compromise structures are used. 

Theorem 7 [419], [171] If G is chosen as the field of rational num- 
bers, the number of real multiplications necessary to calculate a 
length-N DFT where N is a prime number raised to an integer 
power: N — Pm , is given by 

u (DFT (N)) = IN - ((ml + m) /2) t (P - 1) - m - 1 (7.36) 
where t (P — 1 ) is the number of divisors of (P — 1 ) . 

This result seems to be practically achievable only for N — 9, or 
perhaps 25. In the case of N — 9, there are two rational multiplies 
that must be carried out and are counted in Table 7.1 but are not 
predicted by Theorem 7 (p. 75). Experience [187] indicates that 
even for N — 25, an algorithm based on a Cooley-Tukey FFT using 
a type 2 index map gives an over-all more balanced result. 

Theorem 8 [171] If G is chosen as the field of rational numbers, 
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the number of real multiplications necessary to calculate a length- 
N DFT where N — 2m is given by 

u(DFT(N)) = 2N-m2-m-2 (7.37) 

This result is not practically useful because the number of addi- 
tions necessary to realize this minimum of multiplications becomes 
very large for lengths greater than 16. Nevertheless, it proves the 
minimum number of multiplications required of an optimal algo- 
rithm is a linear function of N rather than of NlogN which is that 
required of practical algorithms. The best practical power-of-two 
algorithm seems to the Split-Radix [105] FFT discussed in The 
Cooley-Tukey Fast Fourier Transform Algorithm: The Split-Radix 
FFT Algorithm (Section 9.2: The Split-Radix FFT Algorithm). 

All of these theorems use ideas based on residue reduction, mul- 
tiplication of the residues, and then combination by the CRT. It is 
remarkable that this approach finds the minimum number of re- 
quired multiplications by a constructive proof which generates an 
algorithm that achieves this minimum; and the structure of the op- 
timal algorithm is, within certain variations, unique. For shorter 
lengths, the optimal algorithms give practical programs. For longer 
lengths the uncounted operations involved with the multiplication 
of the higher degree residue polynomials become very large and 
impractical. In those cases, efficient suboptimal algorithms can be 
generated by using the same residue reduction as for the optimal 
case, but by using methods other than the Toom-Cook algorithm of 
Theorem 1 (p. 72) to multiply the residue polynomials. 

Practical long DFT algorithms are produced by combining short 
prime length optimal DFT's with the Type 1 index map from Mul- 
tidimensional Index Mapping (Chapter 3) to give the Prime Factor 
Algorithm (PFA) and the Winograd Fourier Transform Algorithm 
(WFTA) discussed in The Prime Factor and Winograd Fourier 
Transform Algorithms (Chapter 10). It is interesting to note that 
the index mapping technique is useful inside the short DFT algo- 
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rithms to replace the Toom-Cook algorithm and outside to combine 
the short DFT's to calculate long DFT's. 

7.3 The Automatic Generation of Wino- 
grad's Short DFTs 

by Ivan Selesnick, Polytechnic Institute of New York University 

7.3.1 Introduction 

Efficient prime length DFTs are important for two reasons. A par- 
ticular application may require a prime length DFT and secondly, 
the maximum length and the variety of lengths of a PFA or WFTA 
algorithm depend upon the availability of prime length modules. 

This [329], [335], [331], [333] discusses automation of the process 
Winograd used for constructing prime length FFTs [29], [187] for 
N <1 and that Johnson and Burrus [197] extended to N < 19. It 
also describes a program that will design any prime length FFT in 
principle, and will also automatically generate the algorithm as a 
C program and draw the corresponding flow graph. 

Winograd's approach uses Rader's method to convert a prime 
length DFT into a P — 1 length cyclic convolution, polynomial 
residue reduction to decompose the problem into smaller convo- 
lutions [29], [263], and the Toom-Cook algorithm [29], [252]. The 
Chinese Remainder Theorem (CRT) for polynomials is then used 
to recombine the shorter convolutions. Unfortunately, the design 
procedure derived directly from Winograd's theory becomes cum- 
bersome for longer length DFTs, and this has often prevented the 
design of DFT programs for lengths greater than 19. 

Here we use three methods to facilitate the construction of prime 
length FFT modules. First, the matrix exchange property [29], 
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[197], [218] is used so that the transpose of the reduction operator 
can be used rather than the more complicated CRT reconstruction 
operator. This is then combined with the numerical method [197] 
for obtaining the multiplication coefficients rather than the direct 
use of the CRT We also deviate from the Toom-Cook algorithm, 
because it requires too many additions for the lengths in which we 
are interested. Instead we use an iterated polynomial multiplica- 
tion algorithm [29]. We have incorporated these three ideas into 
a single structural procedure that automates the design of prime 
length FFTs. 

7.3.2 Matrix Description 

It is important that each step in the Winograd FFT can be described 
using matrices. By expressing cyclic convolution as a bilinear 
form, a compact form of prime length DFTs can be obtained. 

If y is the cyclic convolution of h and x, then y can be expressed as 

y = C[Ax.*Bh] (7.38) 

where, using the Matlab convention, .* represents point by point 
multiplication. When A,B, and C are allowed to be complex, A and 
B are seen to be the DFT operator and C, the inverse DFT. When 
only real numbers are allowed, A, B, and C will be rectangular. 
This form of convolution is presented with many examples in [29]. 
Using the matrix exchange property explained in [29] and [197] 
this form can be written as 

y = RB T [C T Rh.*Ax] (7.39) 

where R is the permutation matrix that reverses order. 

When h is fixed, as it is when considering prime length DFTs, the 
term C T Rh can be precomputed and a diagonal matrix D formed 
by D — diag{C T Rh}. This is advantageous because in general, C 
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is more complicated than B, so the ability to "hide" C saves com- 
putation. Now y = RB T DAx or y — RA T DAx since A and B can 
be the same; they implement a polynomial reduction. The form 
y — R T DAxT can also be used for the prime length DFTs, it is only 
necessary to permute the entries of x and to ensure that the DC 
term is computed correctly. The computation of the DC term is 
simple, for the residue of a polynomial modulo a — 1 is always the 
sum of the coefficients. After adding the xq term of the original 
input sequence, to the s — l residue, the DC term is obtained. Now 
DFT{x] — RA T DAx. In [197] Johnson observes that by permut- 
ing the elements on the diagonal of D, the output can be permuted, 
so that the R matrix can be hidden in D, and DFT{x] — A T DAx. 
From the knowledge of this form, once A is found, D can be found 
numerically [197]. 

7.3.3 Programming the Design Procedure 

Because each of the above steps can be described by matrices, the 
development of a prime length FFTs is made convenient with the 
use of a matrix oriented programming language such as Matlab. 
After specifying the appropriate matrices that describe the desired 
FFT algorithm, generating code involves compiling the matrices 
into the desired code for execution. 

Each matrix is a section of one stage of the flow graph that corre- 
sponds to the DFT program. The four stages are: 

1. Permutation Stage: Permutes input and output sequence. 

2. Reduction Stage: Reduces the cyclic convolution to smaller 
polynomial products. 

3. Polynomial Product Stage: Performs the polynomial multi- 
plications. 

4. Multiplication Stage: Implements the point-by-point multi- 
plication in the bilinear form. 
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Each of the stages can be clearly seen in the flow graphs for the 
DFTs. Figure 7.3 shows the flow graph for a length 17 DFT algo- 
rithm that was automatically drawn by the program. 




Figure 7.3: Flowgraph of length- 17 DFT 



The programs that accomplish this process are written in Matlab 
and C. Those that compute the appropriate matrices are written in 
Matlab. These matrices are then stored as two ASCII files, with 
the dimensions in one and the matrix elements in the second. A 
C program then reads the flies and compiles them to produce the 
final FFT program in C [335] 



81 



7.3.4 The Reduction Stage 

The reduction of an N th degree polynomial, X (s), modulo the cy- 
clotomic polynomial factors of (s N — l) requires only additions 
for many N, however, the actual number of additions depends 
upon the way in which the reduction proceeds. The reduction is 
most efficiently performed in steps. For example, if N — 4 and 
{(X(s)) s _ l ,((X(s)) s+l stnd ((X(s)) s 2 +l where the double paren- 
thesis denote polynomial reduction modulo (s — 1), s + 1, and 
s 2 + 1, then in the first step ((X(s))) s 2_ l , and ((Xs)) s 2 +i should 
be computed. In the second step, ((Xs))^ and ((Xs)) s+1 can be 
found by reducing ((X(s))) s 2_i This process is described by the 
diagram in Figure 7.4. 




Figure 7.4: Factorization of s 4 — 1 in steps 



When N is even, the appropriate first factorization is 
( S N ' 2 — 1 ) ( s N ' 2 + 1 ) , however, the next appropriate factor- 
ization is frequently less obvious. The following procedure has 
been found to generate a factorization in steps that coincides 
with the factorization that minimizes the cumulative number of 
additions incurred by the steps. The prime factors of iV are the 
basis of this procedure and their importance is clear from the 
useful well-known equation s N — 1 = YlnlN^-n (s) where C n (s) is 
the n th cyclotomic polynomial. 
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We first introduce the following two functions defined on the pos- 
itive integers, 

\f/(N) — the smallest prime factor of N forjV > 1 (7.40) 
and i/a(1) = 1. 

Suppose P(s) is equal to either (s N — l) or an intermediate non- 
cyclotomic polynomial appearing in the factorization process, for 
example, (a 1 — l), above. Write P(s) in terms of its cyclotomic 
factors, 

P(s)=C kl (s) C kl {s) ■■■C kL (7.41) 

define the two sets, G and G , by 

G={ki,---,h} and G = {k/gcd (G) : k e G} (7.42) 
and define the two integers, t and T, by 

t = min{\ff(k) : k e G ,k > 1} and T = (7.43) 
maxnu(k J t) : k G G} 

Then form two new sets, 

A = {keG:T\k} and B={k^G:T\k} (7.44) 

The factorization of P (s) , 

^(*)=fn c *( j )) (n c ^w) ( ? - 45 ) 

\yteA / \fee5 / 

has been found useful in the procedure for factoring (s N — l) . This 
is best illustrated with an example. 

Example: N — 36 
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Step 1. Let P(s) = s 36 - 1. Since P = CxC2C^C A C 6 C 9 CnCnC^ 
G = G -{1,2,3,4,6,9,12,18,36} (7.46) 

t = min{2,3} = 2 (7.47) 

A = {k E G : 4\lc} = {1,2,3,6,9, 18} (7.48) 

B = {keG:4\k} = {4,12,36} (7.49) 

Hence the factorization of s 36 — 1 into two intermediate polyno- 
mials is as expected, 

Y\C k (s)= S ls -l, Y[C k (s) = s w + l (7.50) 

keA keB 

If a 36th degree polynomial, X (s), is represented by a vector of 
coefficients, X — (^35, • • • ,xo) , then ((X (5)) s is_ 1 (represented by 
X') and ((X (s)) s is +1 (represented by X") is given by 

test (7.51) 

which entails 36 additions. 

Step 2. This procedure is repeated with P (s) — s 18 — 1 and P (s) — 
s 18 + 1. We will just show it for the later. Let P (s) = s 18 + 1. Since 
P = C4C12C36 

G= {4,12,36}, G ={1,3,9} (7.52) 

t = min3 = 3 (7.53) 

T = maxv{k,3) : k e G = maxl,3,9 = 9 (7.54) 

A = £eG:9|£} = {4,12} (7.55) 
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B = keG:9\k} = {36} 
This yields the two intermediate polynomials, 



5 6 + l, 
In the notation used above, 



and s n -s 6 



1 



X 
y" 


= 


h 

h 


-h 

h 


sv 




-h 





h 



x 



(7.56) 



(7.57) 



(7.58) 



entailing 24 additions. Continuing this process results in a factor- 
ization in steps 

In order to see the number of additions this scheme uses for num- 
bers of the form N — P — 1 (which is relevant to prime length FFT 
algorithms) figure 4 shows the number of additions the reduction 
process uses when the polynomial X(s) is real. 

Figure 4: Number of Additions for Reduction Stage 



7.3.5 The Polynomial Product Stage 

The iterated convolution algorithm can be used to construct an N 
point linear convolution algorithm from shorter linear convolution 
algorithms [29]. Suppose the linear convolution y, of the n point 
vectors x and h (h known) is described by 



y = E n D E n x 



(7.59) 



where E n is an "expansion" matrix the elements of which are ±/'s 
and O's and D is an appropriate diagonal matrix. Because the only 
multiplications in this expression are by the elements of D, the 
number of multiplications required, M(n), is equal to the number 
of rows of E n . The number of additions is denoted by A (n) . 
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Given a matrix E m and a matrix E ni , the iterated algorithm gives a 
method for combining E ni and E„ 2 to construct a valid expansion 
matrix, E n , for TV < «i«2- Specifically, 

£ «l ,n 2 = ( 7 M(n 2 ) ® £ m ) ( £ « 2 X 7 "l ) ( 7 - 60 ) 

The product ni«2 may be greater than N, for zeros can be (concep- 
tually) appended to x. The operation count associated with E niA2 
is 

A(m,n2) = n!A(« 2 )+A(ni)Afn2 (7.61) 

M (m , n 2 ) = M (m) M (n 2 ) (7.62) 

Although they are both valid expansion matrices, E nu „ 2 / £n 2 ,ni 
and A„, „ 2 7^ A„ 2 ni Because M ni „ 2 7^ M„ 2Jll it is desirable to chose 
an ordering of factors to minimize the additions incurred by the 
expansion matrix. The following [7], [263] follows from above. 

7.3.5.1 Multiple Factors 

Note that a valid expansion matrix, Epf, can be constructed from 
E nun2 and E nj , for /V < n\n2n 3 . In general, any number of factors 
can be used to create larger expansion matrices. The operation 
count associated with E nun2jnj is 

A(ni,n 2 ,n 3 ) = n\niA (n 3 ) + n\A (112) Mfa) + (7.63) 
A(wi)Af (w2)Af (713) 

M(ni,n 2 ,n 3 ) — M (n\) M (n 2 ) M (n 3 ) (7.64) 

These equations generalize in the predicted way when more fac- 
tors are considered. Because the ordering of the factors is relevant 
in the equation for A (.) but not for M (.), it is again desirable to or- 
der the factors to minimize the number of additions. By exploiting 
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the following property of the expressions for A(.) and Af (.), the 
optimal ordering can be found [7], 

reservation of Optimal Ordering. Suppose A(«i,n 2 ,ft3) < 
min{A (n kl , n kl , n k3 ) • ^1,^2,^3 G {1,2,3} and distinct}, then 



A(ni,n 2 )<A(/i2,ni) (7.65) 

2. 

A(n 2 ,n 3 ) <A(n 3 ,n 2 ) (7.66) 

3. 

A(ni,n 3 )<A(/i3,ni) (7.67) 

The generalization of this property to more than two factors reveals 
that an optimal ordering of {m, • • • ,«l-;} is preserved in an opti- 
mal ordering of {«i , • • • «l}- Therefore, if («i , • • • m) is an optimal 
ordering of {«i, •••«!,}, then (n k ,n k +\) is an optimal ordering of 
{tt£,ft£ + i} anc ^ conse q uen tly 

A{n k ) < A(n, +1 ) (?68) 



M(n k )-n k M{n k +\)-n k +\ 
for all £ = 1,2,- •• ,L-1. 

This immediately suggests that an optimal ordering of {m, ■ ■ ■ ul} 
is one for which 

Aim) Aim) 



M(n\) — n\ M{n£)—nL 

is nondecreasing. Hence, ordering the factors, {«!,-•• n^}, to min- 
imize the number of additions incurred by E nu ... jWl simply involves 
computing the appropriate ratios. 
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7.3.6 Discussion and Conclusion 



We have designed prime length FFTs up to length 53 that are as 
good as the previous designs that only went up to 19. Table 1 gives 
the operation counts for the new and previously designed modules, 
assuming complex inputs. 

It is interesting to note that the operation counts depend on the 
factorability of P — 1. The primes 1 1 , 23, and 47 are all of the form 
1 + 2Pi making the design of efficient FFTs for these lengths more 
difficult. 

Further deviations from the original Winograd approach than we 
have made could prove useful for longer lengths. We investigated, 
for example, the use of twiddle factors at appropriate points in 
the decomposition stage; these can sometimes be used to divide 
the cyclic convolution into smaller convolutions. Their use means, 
however, that the 'center* multiplications would no longer be by 
purely real or imaginary numbers. 
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N 


Mult 


Adds 


7 


16 


72 


11 


40 


168 


13 


40 


188 


17 


82 


274 


19 


88 


360 


23 


174 


672 


29 


190 


766 


31 


160 


984 


37 


220 


920 


41 


282 


1140 


43 


304 


1416 


47 


640 


2088 


53 


556 


2038 



Table 7.2: Operation counts for prime length DFTs 



The approach in writing a program that writes another program is a 
valuable one for several reasons. Programming the design process 
for the design of prime length FFTs has the advantages of being 
practical, error-free, and flexible. The flexibility is important be- 
cause it allows for modification and experimentation with different 
algorithmic ideas. Above all, it has allowed longer DFTs to be 
reliably designed. 

More details on the generation of programs for prime length FFTs 
can be found in the 1993 Technical Report. 



Chapter 8 

DFT and FFT: An Algebraic 
View 

by Markus Pueschel, Carnegie Mellon University 

In infinite, or non-periodic, discrete-time signal processing, there is 
a strong connection between the z- transform, Laurent series, con- 
volution, and the discrete-time Fourier transform (DTFT) [277]. 
As one may expect, a similar connection exists for the DFT but 
bears surprises. Namely, it turns out that the proper framework 
for the DFT requires modulo operations of polynomials, which 
means working with so-called polynomial algebras [138]. Asso- 
ciated with polynomial algebras is the Chinese remainder theo- 
rem, which describes the DFT algebraically and can be used as 
a tool to concisely derive various FFTs as well as convolution al- 
gorithms [268], [409], [414], [12] (see also Winograd's Short DFT 
Algorithms (Chapter 7)). The polynomial algebra framework was 
fully developed for signal processing as part of the algebraic sig- 
nal processing theory (ASP). ASP identifies the structure under- 
lying many transforms used in signal processing, provides deep 
insight into their properties, and enables the derivation of their fast 



lr rhis content is available online at <http://cnx.org/content/ml6331/L14/>. 
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algorithms [295], [293], [291], [294]. Here we focus on the alge- 
braic description of the DFT and on the algebraic derivation of the 
general-radix Cooley-Tukey FFT from Factoring the Signal Pro- 
cessing Operators (Chapter 6). The derivation will make use of 
and extend the Polynomial Description of Signals (Chapter 4). We 
start with motivating the appearance of modulo operations. 

The z-transform associates with infinite discrete signals X — 
(• • • ,jc(— 1) ,x(0) ,jc(1) , • • •) a Laurent series: 

X^X(s)= £;c(n)s n . (8.1) 

Here we used s — z~ l to simplify the notation in the following. 
The DTFT of X is the evaluation of X (s) on the unit circle 

X (e~ jco ) , - Tt < 0) < Tt. (8.2) 

Finally, filtering or (linear) convolution is simply the multiplica- 
tion of Laurent series, 

H*X^H(s)X(s). (8.3) 

For finite signals X — (x(0) , • • • ,x(N — 1)) one expects that the 
equivalent of (8.1) becomes a mapping to polynomials of degree 
N-l, 

N-l 

X^X(s)= J^x(n)s n , (8.4) 

and that the DFT is an evaluation of these polynomials. Indeed, 
the definition of the DFT in Winograd's Short DFT Algorithms 
(Chapter 7) shows that 

C(k)=x(w^j =x(e' j ^Y 0<k<N, (8.5) 

i.e., the DFT computes the evaluations of the polynomial X (s) at 
the nth roots of unity. 
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The problem arises with the equivalent of (8.3), since the multipli- 
cation H (s) X (s) of two polynomials of degree N—l yields one of 
degree 2N — 2. Also, it does not coincide with the circular convo- 
lution known to be associated with the DFT. The solution to both 
problems is to reduce the product modulo s n — 1 : 



H* ciTC X <-► H (s)X(s) mod (s n - I) 



(8.6) 



Concept 


Infinite Time 


Finite Time 


Signal 


X(s) 
Lnezx{n)s n 


lS*(«k 


Filter 


H(s) 

Unezh (n) s" 


L N n =ih(n) S » 


Convolution 


H(s)X(s) 


H (s)X(s) mod (s n - 


Fourier transform 


DTFT: X (e~ jco ) , 

K < CO < K 


-TJFT: X (e-j'f) , 
k < n 



1) 

0< 



Table 8.1: Infinite and finite discrete time signal processing. 



The resulting polynomial then has again degree N —\ and this form 
of convolution becomes equivalent to circular convolution of the 
polynomial coefficients. We also observe that the evaluation points 
in (8.5) are precisely the roots of s n — 1. This connection will be- 
come clear in this chapter. 

The discussion is summarized in Table 8.1. 

The proper framework to describe the multiplication of polynomi- 
als modulo a fixed polynomial are polynomial algebras. Together 
with the Chinese remainder theorem, they provide the theoretical 
underpinning for the DFT and the Cooley-Tukey FFT. 
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In this chapter, the DFT will naturally arise as a linear mapping 
with respect to chosen bases, i.e., as a matrix. Indeed, the def- 
inition shows that if all input and outputs are collected into vec- 

torsX = (X (()),••■ ,X(tf-l))andC=(C(0),---C(tf-l)),theii 
Winograd's Short DFT Algorithms (Chapter 7) is equivalent to 

C = DFT N X, (8.7) 

where 



W* n 



(8.8) 

0<k,n<N 



DFT N 

The matrix point of view is adopted in the FFT books [388], [381]. 



8.1 Polynomial Algebras and the DFT 

In this section we introduce polynomial algebras and explain how 
they are associated to transforms. Then we identify this connec- 
tion for the DFT. Later we use polynomial algebras to derive the 
Cooley-Tukey FFT. 

For further background on the mathematics in this section and 
polynomial algebras in particular, we refer to [138]. 

8.1.1 Polynomial Algebra 

An algebra srf is a vector space that also provides a multiplication 
of its elements such that the distributivity law holds (see [138] for 
a complete definition). Examples include the sets of complex or 
real numbers C or R, and the sets of complex or real polynomials 
in the variable s: C [s] or R [s] . 

The key player in this chapter is the polynomial algebra. Given a 
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fixed polynomial P (s) of degree deg (P) — N, we define a polyno- 
mial algebra as the set 

C [s] jP (s) = {X (s) | deg (X) < deg (P) } (8.9) 

of polynomials of degree smaller than N with addition and mul- 
tiplication modulo P. Viewed as a vector space, C [s] jP (s) hence 
has dimension N. 

Every polynomial X (s) G C [s] is reduced to a unique polynomial 
R (s) modulo P (s) of degree smaller than N. R (s) is computed 
using division with rest, namely 

X(s) = Q(s)P(s)+R(s), deg(R)<deg(P). (8.10) 

Regarding this equation modulo P, P (s) becomes zero, and we get 

X(s)=R(s) mod P(s). (8.11) 

We read this equation as "X (s) is congruent (or equal) R (s) mod- 
ulo P (s) ." We will also write X (s) mod P (s) to denote that X (s) 
is reduced modulo P (s). Obviously, 

P (s) = mod P(s). (8.12) 

As a simple example we consider stf — C [s] / (s 2 — l) , which has 
dimension 2. A possible basis is b — (l,s). In gd , for example, 
5-(5+l)=5 2 + 5 = 5+l mod (^ 2 — 1 ) , obtained through division 
with rest 

s 2 + s= l-(s 2 -l) + (s+l) (8.13) 

or simply by replacing s 2 with 1 (since s 2 — 1 — implies s 2 — 1). 

8.1.2 Chinese Remainder Theorem (CRT) 

Assume P(s) — Q (s) R (s) factors into two coprime (no common 
factors) polynomials Q and R. Then the Chinese remainder theo- 
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rem (CRT) for polynomials is the linear mapping 2 



A : C[s]/P(s) -> C[s]/Q(s) (8.14) 
C[s]/R{s),X(s) h-> 

(X (j) mod g (s) ,X (j) mod /?($)). 

Here, © is the Cartesian product of vector spaces with elementwise 
operation (also called outer direct sum). In words, the CRT asserts 
that computing (addition, multiplication, scalar multiplication) in 
C [s] /P (s) is equivalent to computing in parallel in C [s] /Q (s) and 

C[s]/R(s). 

If we choose bases b, c, d in the three polynomial algebras, then A 
can be expressed as a matrix. As usual with linear mappings, this 
matrix is obtained by mapping every element of b with A, express- 
ing it in the concatenation c U d of the bases c and d, and writing 
the results into the columns of the matrix. 

As an example, we consider again the polynomial P (s) — s 2 — 1 — 
(s — 1 ) (s + 1 ) and the CRT decomposition 

A: C[s]/(s 2 -l)^C[s]/(x-l)(BC[s]/(x+l). (8.15) 

As bases, we choose b — (1, jc) , c — (1) , d — (1). A(l) = (1, 1) 
with the same coordinate vector in c U d — (1,1). Further, because 
of jc = 1 mod (x— 1) and x = — 1 mod (x+ 1), A(x) — (x,x) = 
(1,-1) with the same coordinate vector. Thus, A in matrix form is 
the so-called butterfly matrix, which is a DFT of size 2: DFT2 — 

1 1 

1 -1 



More precisely, isomorphism of algebras or isomorphism of «e/-modules. 
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8.1.3 Polynomial Transforms 

Assume P(s) G C [s] has pairwise distinct zeros a — 
(«o,- •• , ocjv-i). Then the CRT can be used to completely 
decompose C [s] jP (s) into its spectrum: 



A : £[s]/P(s) -> C[j]/(j-Oo) (8.16) 

C[s]/(s-a N -i),X(s) i-> 

(X(s) mod (s — Ob) , ••• ,-X"(s) mod (s — a^_i)) = 
(j(ab),--- ,j(aiv-i)). 

If we choose a basis b — (Po (s) , • • • ,Pjv_i (s)) in C [s] jP (s) and 
bases b[ = (1) in each C [s] / (s— a,), then A, as a linear mapping, 
is represented by a matrix. The matrix is obtained by mapping 
every basis element P n ,0<n<N, and collecting the results in the 
columns of the matrix. The result is 

^b,a=[Pn(a k )] ^ n<N (8.17) 

and is called the polynomial transform for &/ — C [s] jP (s) with 
basis b. 

If, in general, we choose b( = (J3,-) as spectral basis, then the matrix 
corresponding to the decomposition (8.16) is the scaled polyno- 
mial transform 

diag < k<N (l/P n )^b,a, (8.18) 

where diag 0<n<N (y n ) denotes a diagonal matrix with diagonal en- 
tries Yn- 

We jointly refer to polynomial transforms, scaled or not, as Fourier 
transforms. 
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8.1.4 DFT as a Polynomial Transform 

We show that the DFT^ is a polynomial transform for srf — 
C[s]/(s N -l) with basis Z>= (l, s, ■■■ ,5 iV " 1 ). Namely, 

^-1= n (*-wi)> ( 8 - 19 ) 

0<k<N 

which means that A takes the form 



A : C[s]/(s N -\) -> C[s]/(s-W%) (8.20) 

• •• e C[s}/(s-w»- l ),x( s ) 

(x (s) mod (s-W$),---,X(s) mod (s - W[ 

(*K),...,x (<-!)). 

The associated polynomial transform hence becomes 



rN-l 

'N 



wfr 



= DFT N . (8.21) 

0<k,n<N 



This interpretation of the DFT has been known at least since [409], 
[268] and clarifies the connection between the evaluation points in 
(8.5) and the circular convolution in (8.6). 

In [40], DFTs of types 1-4 are defined, with type 1 being the stan- 
dard DFT. In the algebraic framework, type 3 is obtained by choos- 
ing srf — C [s] J (s N + l) as algebra with the same basis as before: 



,a — 



w (k+l/2)n 



= DFT-3 N , (8.22) 

0<k,n<N 



The DFTs of type 2 and 4 are scaled polynomial transforms [295]. 
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8.2 Algebraic Derivation of the Cooley- 
Tukey FFT 

Knowing the polynomial algebra underlying the DFT enables us to 
derive the Cooley-Tukey FFT algebraically. This means that in- 
stead of manipulating the DFT definition, we manipulate the poly- 
nomial algebra C [s] / (s N — l). The basic idea is intuitive. We 
showed that the DFT is the matrix representation of the complete 
decomposition (8.20). The Cooley-Tukey FFT is now derived by 
performing this decomposition in steps as shown in Figure 8.1. 
Each step yields a sparse matrix; hence, the DFT^ is factorized 
into a product of sparse matrices, which will be the matrix repre- 
sentation of the Cooley-Tukey FFT. 

C[s]/P(s) 



Fourier transform 



partial decomposition 



C[s]/(s-a k ) 

0<k<N 



Figure 8.1: Basic idea behind the algebraic derivation of 
Cooley-Tukey type algorithms 



This stepwise decomposition can be formulated generically for 
polynomial transforms [292], [294]. Here, we consider only the 
DFT. 
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We first introduce the matrix notation we will use and in particu- 
lar the Kronecker product formalism that became mainstream for 
FFTsin[388], [381]. 

Then we first derive the radix-2 FFT using a factorization of 
s N — 1 . Subsequently, we obtain the general-radix FFT using a 
decomposition of s N — 1 . 



8.2.1 Matrix Notation 

We denote the N xN identity matrix with Iff, and diagonal matrices 
with 



diag 0<k< fj(y k ) 



7o 



(8.23) 



Yn-i 

The N x N stride permutation matrix is defined for N — KM by 
the permutation 



L N M : iK + j^jM + i (8.24) 

for < i < K, < j < M. This definition shows that L^ trans- 
poses a K x M matrix stored in row-major order. Alternatively, we 
can write 



L%:i ^ iMmodN 
1, N-l i-> N-l. 



1, forO < i < N 



(8.25) 



For example (• means 0), 



1 • 



• 1 • 
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(8.26) 



^n/2 * s some ti mes called the perfect shuffle. 
Further, we use matrix operators; namely the direct sum 



A®B = 



B 



and the Kronecker or tensor product 

A ® B = [a k jB] k j, for A = [a ky 
In particular, 



I n <g)A =A( 



is block-diagonal. 



(8.27) 



(8.28) 



(8.29) 



100 



CHAPTER 8. DFTANDFFT.AN 
ALGEBRAIC VIEW 



We may also construct a larger matrix as a matrix of matrices, e.g., 



A B 
B A 



(8.30) 



If an algorithm for a transform is given as a product of sparse 
matrices built from the constructs above, then an algorithm for the 
transpose or inverse of the transform can be readily derived using 
mathematical properties including 



(AS) 5 

(A0B) 5 



T A T 



5M 



)S J 



(ABY { 

(a® By 1 



B~ l A-\ 

■A- l ®B- x . 



(8.31) 



(A®B) T =A T ®B T , (A®B)~ l =A~ l ®B~ l . 



Permutation matrices are orthogonal, i.e., P T — P 1 . The transpo- 
sition or inversion of diagonal matrices is obvious. 



8.2.2 Radix-2 FFT 

The DFT decomposes srf — C[s] / (s N — l) with basis b — 
(1,5,- •• ,s N ~ l ) as shown in (8.20). We assume N = 2M. Then 



„2M 



1=(^-1)(^+1) 



factors and we can apply the CRT in the following steps: 

C[j]/(^-1) 

- C[s]/(s M -l)eC[s]/(s M +l) 

© C[s]/(x-Wg)® © C[s]/(x-W* +1 ] 



0<i<M 



0<i<M 



© C[s]/(x-WJj). 

0<i<N 



(8.32) 

(8.33) 

) 

(8.34) 

(8.35) 



As bases in the smaller algebras C[s] / (s M — l) 
[s]/(s M + l), we choose c = d = (l,,sy- • ,s" 



m _ 

M-V 



101 

and 
The 

derivation of an algorithm for DFTn based on (8.33)-(8.35) is 
now completely mechanical by reading off the matrix for each of 
the three decomposition steps. The product of these matrices is 
equal to the DFT^. 

First, we derive the base change matrix B corresponding to (8.33). 
To do so, we have to express the base elements s" E b in the basis 
c U d; the coordinate vectors are the columns of B. For < n < M, 
s" is actually contained in c and d, so the first M columns of B are 



B 



Im * 
I M * 



(8.36) 



where the entries * are determined next. For the base elements 
s M+n , < n < M, we have 



M+n 



M+n 



5" mod (s M -l), 
-s n mod (s M +l) 



(8.37) 



which yields the final result 



5 = 



Im Im 
Im —Im 



DFT 2 ®Ii 



M- 



(8.38) 



Next, we consider step (8.34). C [s] / {s M — l) is decomposed by 
DFT M and C [s] / (s M + l) by DFT-3 M in (8.22). 

Finally, the permutation in step (8.35) is the perfect shuffle L M , 
which interleaves the even and odd spectral components (even and 
odd exponents of Wn). 

The final algorithm obtained is 



DFT 2m = L m (DFTm®DFT-3m)(DFT 2 ®Im). (8.39) 
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To obtain a better known form, we use DFT-2>m = DFTmDm, 
with Dm — diag 0<i<M (W^) , which is evident from (8.22). It yields 



z)Fr 2M = z^ (z)Fr M e dft m d m ) (dft 2 / M ) ^40) 

L& (7 2 DF7 M ) (7 M © D M ) (DFT 2 /m) • 

The last expression is the radix-2 decimation-in-frequency Cooley- 
Tukey FFT. The corresponding decimation-in-time version is ob- 
tained by transposition using (8.31) and the symmetry of the DFT: 

DFT 2 M = (DFT 2 ®I M )(lM(BDM)(h®DFT M )L%. (8.41) 

The entries of the diagonal matrix Im © Dm are commonly called 
twiddle factors. 

The above method for deriving DFT algorithms is used extensively 
in [268]. 

8.2.3 General-radix FFT 

To algebraically derive the general-radix FFT, we use the decom- 
position property of s N — 1 . Namely, if N — KM then 



s 



N ■ i M\ K 



l = (s M ) -1. (8.42) 



Decomposition means that the polynomial is written as the com- 
position of two polynomials: here, s M is inserted into s K — 1. Note 
that this is a special property: most polynomials do not decompose. 

Based on this polynomial decomposition, we obtain the following 
stepwise decomposition of C [s] / (s N — l) , which is more general 
than the previous one in (8.33)— (8.35). The basic idea is to first 
decompose with respect to the outer polynomial t K — 1, t — s M , 
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and then completely [292]: 

C[s]/(s N -l)=C[x]/((s M ) K -l) 
£[s\/{s M -W l K ) 



0<i<K 



(8.43) 

6 C[s]/(x-W^ K+i ) (8.44) 

0<i<K0<j<M V / 

- C[s]/(*-Wft). (8.45) 

0<i<N 

As bases in the smaller algebras C [s] / {s M — w£) we choose 
Ci — (l,5,- • • ,5 M_1 ). As before, the derivation is completely 
mechanical from here: only the three matrices corresponding to 
(8.43)-(8.45) have to be read off. 

The first decomposition step requires us to compute 
s n mod (s M — w£) , < n < N. To do so, we decompose 
the index n as n — IM + m and compute 

$ n = /M+m = ( s Mj ^m _ ^Imgn mod ^M _ w £ _ (g 46) 

This shows that the matrix for (8.43) is given by DFTk®Im- 

In step (8.44), each C [s] / (s M — Wj^j is completely decomposed 
by its polynomial transform 

DFT M (i, K) = DFT M ■ diag < i<M (w#) . (8.47) 

At this point, C [s] j [s N — l) is completely decomposed, but the 
spectrum is ordered according to jK + i,0<i<M,0<j<K (j 
runs faster). The desired order is iM + j. 

Thus, in step (8.45), we need to apply the permutation jK + i h- ► 
iM + j, which is exactly the stride permutation L^ in (8.24). 
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In summary, we obtain the Cooley-Tukey decimation-in-frequency 
FFT with arbitrary radix: 



L N M [®DFT M -dia^ (u#) J (DFT k ®I M )i%A*) 

L N M (Ik® DFT m ) T* (DFT k ® I M ) . 

The matrix T$ is diagonal and usually called the twiddle matrix. 
Transposition using (8.31) yields the corresponding decimation-in- 
time version: 

{DFT k ®I M )T^(I K ®DFT M )L N K . (8.49) 



8.3 Discussion and Further Reading 

This chapter only scratches the surface of the connection between 
algebra and the DFT or signal processing in general. We provide a 
few references for further reading. 

8.3.1 Algebraic Derivation of Transform Algorithms 

As mentioned before, the use of polynomial algebras and the CRT 
underlies much of the early work on FFTs and convolution algo- 
rithms [409], [268], [12]. For example, Winograd's work on FFTs 
minimizes the number of non-rational multiplications. This and 
his work on complexity theory in general makes heavy use of poly- 
nomial algebras [409], [414], [417] (see Chapter Winograd's Short 
DFT Algorithms (Chapter 7) for more information and references). 
See [72] for a broad treatment of algebraic complexity theory. 
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Since C [x] / (s N — l) — C [C/v] can be viewed a group algebra for 
the cyclic group, the methods shown in this chapter can be trans- 
lated into the context of group representation theory. For exam- 
ple, [256] derives the general-radix FFT using group theory and 
also uses already the Kronecker product formalism. So does Beth 
and started the area of FFTs for more general groups [23], [231]. 
However, Fourier transforms for groups have found only sporadic 
applications [317]. Along a related line of work, [117] shows that 
using group theory it is possible that to discover and generate cer- 
tain algorithms for trigonometric transforms, such as discrete co- 
sine transforms (DCTs), automatically using a computer program. 

More recently, the polynomial algebra framework was extended 
to include most trigonometric transforms used in signal process- 
ing [293], [295], namely, besides the DFT, the discrete cosine and 
sine transforms and various real DFTs including the discrete Hart- 
ley transform. It turns out that the same techniques shown in this 
chapter can then be applied to derive, explain, and classify most 
of the known algorithms for these transforms and even obtain a 
large class of new algorithms including general-radix algorithms 
for the discrete cosine and sine transforms (DCTs/DSTs) [292], 
[294], [398], [397]. 

This latter line of work is part of the algebraic signal processing 
theory briefly discussed next. 

8.3.2 Algebraic Signal Processing Theory 

The algebraic properties of transforms used in the above work on 
algorithm derivation hints at a connection between algebra and 
(linear) signal processing itself. This is indeed the case and was 
fully developed in a recent body of work called algebraic signal 
processing theory (ASP). The foundation of ASP is developed in 
[295], [293], [291]. 



106 



CHAPTER 8. DFTANDFFT.AN 
ALGEBRAIC VIEW 



ASP first identifies the algebraic structure of (linear) signal pro- 
cessing: the common assumptions on available operations for fil- 
ters and signals make the set of filters an algebra^ and the set 
of signals an associated ^-module M . ASP then builds a signal 
processing theory formally from the axiomatic definition of a sig- 
nal model: a triple (^,^#,<J>), where <£ generalizes the idea of 
the z-transform to mappings from vector spaces of signal values to 
^#. If a signal model is given, other concepts, such as spectrum, 
Fourier transform, frequency response are automatically defined 
but take different forms for different models. For example, infi- 
nite and finite time as discussed in Table 8.1 are two examples of 
signal models. Their complete definition is provided in Table 8.2 
and identifies the proper notion of a finite z-transform as a mapping 
C n ^C[s]/(s n -l). 



Signal model 



Infinite time 



Finite time 



{LnezH(n)s n | 

(■■■,H(-l),H(0). 

^(Z)} 



2[x]/(s»-l) 
H(1),-)G 



{LnezX(n)s n | 
(■■■,X(-l),X(0) 



X 



cw/(*»-i; 

(i),-)e 



$ 



<£: P{1)^J£ 



4>: € n ^J? 



defined in (8.1) 



defined in (8.4) 



Table 8.2: Infinite and finite time models as defined in ASP. 

ASP shows that many signal models are in principle possible, each 
with its own notion of filtering and Fourier transform. Those that 
support shift-invariance have commutative algebras. Since finite- 
dimensional commutative algebras are precisely polynomial alge- 
bras, their appearance in signal processing is explained. For exam- 
ple, ASP identifies the polynomial algebras underlying the DCTs 
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and DSTs, which hence become Fourier transforms in the ASP 
sense. The signal models are called finite space models since they 
support signal processing based on an undirected shift operator, 
different from the directed time shift. Many more insights are 
provided by ASP including the need for and choices in choosing 
boundary conditions, properties of transforms, techniques for de- 
riving new signal models, and the concise derivation of algorithms 
mentioned before. 
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Chapter 9 

The Cooley-Tukey Fast 
Fourier Transform 
Algorithm 1 



The publication by Cooley and Tukey [90] in 1965 of an efficient 
algorithm for the calculation of the DFT was a major turning point 
in the development of digital signal processing. During the five or 
so years that followed, various extensions and modifications were 
made to the original algorithm [95]. By the early 1970's the prac- 
tical programs were basically in the form used today. The standard 
development presented in [274], [299], [38] shows how the DFT of 
a length-N sequence can be simply calculated from the two length- 
N/2 DFT's of the even index terms and the odd index terms. This 
is then applied to the two half-length DFT's to give four quarter- 
length DFT's, and repeated until N scalars are left which are the 
DFT values. Because of alternately taking the even and odd index 
terms, two forms of the resulting programs are called decimation- 
in-time and decimation-in-frequency. For a length of 2 M , the di- 
viding process is repeated M — log 2 N times and requires N multi- 
plications each time. This gives the famous formula for the com- 
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putational complexity of the FFT of Nlog 2 N which was denvea in 
Multidimensional Index Mapping: Equation 34 (3.34). 

Although the decimation methods are straightforward and easy to 
understand, they do not generalize well. For that reason it will be 
assumed that the reader is familiar with that description and this 
chapter will develop the FFT using the index map from Multidi- 
mensional Index Mapping (Chapter 3). 

The Cooley-Tukey FFT always uses the Type 2 index map from 
Multidimensional Index Mapping: Equation 11 (3.11). This is 
necessary for the most popular forms that have N = R M , but is 
also used even when the factors are relatively prime and a Type 1 
map could be used. The time and frequency maps from Multidi- 
mensional Index Mapping: Equation 6 (3.6) and Multidimensional 
Index Mapping: Equation 12 (3.12) are 

n = ((K x n { +K 2 n 2 )) N (9.1) 

k=((K 3 k 1 +K 4 k 2 )) N (9.2) 

Type-2 conditions Multidimensional Index Mapping: Equation 8 
(3.8) and Multidimensional Index Mapping: Equation 11 (3.11) 
become 



and 



K\ = aN 2 or K 2 = bN\ but not both (9.3) 



K 3 = cN 2 or K 4 = dN\ but not both (9.4) 



The row and column calculations in Multidimensional Index Map- 
ping: Equation 15 (3.15) are uncoupled by Multidimensional Index 
Mapping: Equation 16 (3.16) which for this case are 

({KiK 4 )) N = or ((K 2 K 3 )) N = but not both (9.5) 
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To make each short sum a DFT, the K( must satisfy 

{{K 1 K 3 )) N ^N 2 and ((K 2 K 4 )) N = #i (9.6) 

In order to have the smallest values for Kj the constants in (9.3) 
are chosen to be 



a = d = K 2 = K 3 = 1 (9.7) 

which makes the index maps of (9.1) become 

n = N 2 ni+n 2 (9.8) 

k = ki+Nik 2 (9.9) 

These index maps are all evaluated modulo N, but in (9.8), explicit 
reduction is not necessary since n never exceeds N. The reduction 
notation will be omitted for clarity. From Multidimensional In- 
dex Mapping: Equation 15 (3.15) and example Multidimensional 
Index Mapping: Equation 19 (3.19), the DFT is 

N 2 -lN 1 -i 

X = £ £ x Wfi* wtf* Wjg* (9.10) 

«2=0«i=0 

This map of (9.8) and the form of the DFT in (9.10) are the fun- 
damentals of the Cooley-Tukey FFT 

The order of the summations using the Type 2 map in (9.10) cannot 
be reversed as it can with the Type-1 map. This is because of the 
Wn terms, the twiddle factors. 

Turning (9.10) into an efficient program requires some care. From 
Multidimensional Index Mapping: Efficiencies Resulting from In- 
dex Mapping with the DFT (Section 3.3: Efficiencies Resulting 
from Index Mapping with the DFT) we know that all the factors 
should be equal. If N — R M , with R called the radix, N\ is first 
set equal to R and N 2 is then necessarily 7? M_1 . Consider n\ to 
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be the index along the rows and «2 along the columns. The inner 
sum of (9.10) over n\ represents a length-TYi DFT for each value 
of «2- These N2 length-TYi DFT's are the DFT's of the rows of the 
x(ni,ri2) array. The resulting array of row DFT's is multiplied by 
an array of twiddle factors which are the Wn terms in (9.10). The 
twiddle-factor array for a length-8 radix-2 FFT is 



TF 



W t 



n 2 ki 



w° 


w° 




1 


1 


w° 


w 1 




1 


w 


w° 


w 2 




1 


-j 


w° 


w 3 




1 


-JW 



(9.11) 



The twiddle factor array will always have unity in the first row and 
first column. 

To complete (9.10) at this point, after the row DFT's are multiplied 
by the TF array, the N\ length-A^ DFT's of the columns are calcu- 
lated. However, since the columns DFT's are of length 7? M ~ ! , they 
can be posed as a R M ~ 2 by R array and the process repeated, again 
using length-/? DFT's. After M stages of length-/? DFT's with TF 
multiplications interleaved, the DFT is complete. The flow graph 
of a length-2 DFT is given in Figure 1 (7.18) and is called a butter- 
fly because of its shape. The flow graph of the complete length-8 
radix-2 FFT is shown in Figure 2 (7.19) . 



Radix-2 Butterfly 



Figure 9.1: A Radix-2 Butterfly 
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X(0) = x(0) + x(1) 



X(0) = x(0)-x(1) 




Figure 9.2: Length-8 Radix-2 FFT Flow Graph 
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This flow-graph, the twiddle factor map of (9.11), and the basic 
equation (9.10) should be completely understood before going fur- 
ther. 

A very efficient indexing scheme has evolved over the years that 
results in a compact and efficient computer program. A FORTRAN 
program is given below that implements the radix-2 FFT. It should 
be studied [64] to see how it implements (9.10) and the flow-graph 
representation. 
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M2 = N 






DO 10 K = 1, M 






Ml = M2 






M2 = M2/2 






E = 6.28318/M1 






A = 






DO 20 J = 1, M2 






C = COS (A) 






S =-SIM (A) 






A = J*E 






DO 30 I = J, N, 


Nl 




L = I + N2 






XT = X(I) 


- X(L) 




X(I) = X(I) 


+ X(L) 




YT = Y(I) 


- Y(L) 




Y(I) = Y(I) 


+ Y(L) 




X(L) = XT*C 


- YT*S 




Y(L) = XT*S 


+ YT*C 


30 


CONTINUE 




20 


CONTINUE 




10 


CONTINUE 





Listing 9.1: A Radix-2 Cooley-Tukey FFT Program 



This discussion, the flow graph of Winograd's Short DFT Algo- 
rithms: Figure 2 (Figure 7.2) and the program of p. ?? are all based 
on the input index map of Multidimensional Index Mapping: Equa- 
tion 6 (3.6) and (9.1) and the calculations are performed in-place. 
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According to Multidimensional Index Mapping: In-PIace Calcu- 
lation of the DFT and Scrambling (Section 3.2: In-Place Calcula- 
tion of the DFT and Scrambling), this means the output is scram- 
bled in bit-reversed order and should be followed by an unscram- 
bler to give the DFT in proper order. This formulation is called 
a decimation-in-frequency FFT [274], [299], [38]. A very similar 
algorithm based on the output index map can be derived which is 
called a decimation-in-time FFT. Examples of FFT programs are 
found in [64] and in the Appendix of this book. 

9.1 Modifications to the Basic Cooley-Tukey 
FFT 

Soon after the paper by Cooley and Tukey, there were improve- 
ments and extensions made. One very important discovery was the 
improvement in efficiency by using a larger radix of 4, 8 or even 16. 
For example, just as for the radix-2 butterfly, there are no multipli- 
cations required for a length-4 DFT, and therefore, a radix-4 FFT 
would have only twiddle factor multiplications. Because there are 
half as many stages in a radix-4 FFT, there would be half as many 
multiplications as in a radix-2 FFT. In practice, because some of 
the multiplications are by unity, the improvement is not by a fac- 
tor of two, but it is significant. A radix-4 FFT is easily developed 
from the basic radix-2 structure by replacing the length-2 butter- 
fly by a length-4 butterfly and making a few other modifications. 
Programs can be found in [64] and operation counts will be given 
in "Evaluation of the Cooley-Tukey FFT Algorithms" (Section 9.3: 
Evaluation of the Cooley-Tukey FFT Algorithms). 

Increasing the radix to 8 gives some improvement but not as much 
as from 2 to 4. Increasing it to 16 is theoretically promising but the 
small decrease in multiplications is somewhat offset by an increase 
in additions and the program becomes rather long. Other radices 
are not attractive because they generally require a substantial num- 
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ber of multiplications and additions in the butterflies. 

The second method of reducing arithmetic is to remove the un- 
necessary TF multiplications by plus or minus unity or by plus or 
minus the square root of minus one. This occurs when the expo- 
nent of W/v is zero or a multiple ofN/4. A reduction of additions as 
well as multiplications is achieved by removing these extraneous 
complex multiplications since a complex multiplication requires at 
least two real additions. In a program, this reduction is usually 
achieved by having special butterflies for the cases where the TF is 
one or j. As many as four special butterflies may be necessary to 
remove all unnecessary arithmetic, but in many cases there will be 
no practical improvement above two or three. 

In addition to removing multiplications by one or j, there can 
be a reduction in multiplications by using a special butterfly for 
TFs with W N /g, which have equal real and imaginary parts. Also, 
for computers or hardware with multiplication considerably slower 
than addition, it is desirable to use an algorithm for complex mul- 
tiplication that requires three multiplications and three additions 
rather than the conventional four multiplications and two additions. 
Note that this gives no reduction in the total number of arithmetic 
operations, but does give a trade of multiplications for additions. 
This is one reason not to use complex data types in programs but 
to explicitly program complex arithmetic. 

A time-consuming and unnecessary part of the execution of a FFT 
program is the calculation of the sine and cosine terms which are 
the real and imaginary parts of the TFs. There are basically three 
approaches to obtaining the sine and cosine values. They can be 
calculated as needed which is what is done in the sample program 
above. One value per stage can be calculated and the others recur- 
sively calculated from those. That method is fast but suffers from 
accumulated round-off errors. The fastest method is to fetch pre- 
calculated values from a stored table. This has the disadvantage of 
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requiring considerable memory space. 

If all the N DFT values are not needed, special forms of the FFT 
can be developed using a process called pruning [226] which re- 
moves the operations concerned with the unneeded outputs. 

Special algorithms are possible for cases with real data or with 
symmetric data [82]. The decimation-in-time algorithm can be 
easily modified to transform real data and save half the arithmetic 
required for complex data [357]. There are numerous other mod- 
ifications to deal with special hardware considerations such as an 
array processor or a special microprocessor such as the Texas In- 
struments TMS320. Examples of programs that deal with some of 
these items can be found in [299], [64], [82]. 

9.2 The Split-Radix FFT Algorithm 

Recently several papers [228], [106], [393], [350], [102] have been 
published on algorithms to calculate a length-2 M DFT more ef- 
ficiently than a Cooley-Tukey FFT of any radix. They all have 
the same computational complexity and are optimal for lengths up 
through 16 and until recently was thought to give the best total 
add-multiply count possible for any power-of-two length. Yavne 
published an algorithm with the same computational complexity 
in 1968 [421], but it went largely unnoticed. Johnson and Frigo 
have recently reported the first improvement in almost 40 years 
[201]. The reduction in total operations is only a few percent, but 
it is a reduction. 

The basic idea behind the split-radix FFT (SRFFT) as derived by 
Duhamel and Hollmann [106], [102] is the application of a radix-2 
index map to the even-indexed terms and a radix-4 map to the odd- 
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indexed terms. The basic definition of the DFT 

N-l 



C k =£x n W nk (9.12) 



n=0 



with W = e - jlK/N gives 



N/2-1 
C 2k = £ [xn+Xn +N/2 ] W 2nk (9.13) 



h=0 
for the even index terms, and 



N/4-1 
Ctf+l = I [ (*» " *„+Ar/ 2 ) " J {Xn+N/4 ~ X n+ 3$4$ W n W 4nk 



n=0 



and 



N/4-l 

C 4 k+3 = I [ (*» " *„+AT/ 2 ) + 7 fe+iV/4 " Xn+3$$b W *" ^ 
n=0 

for the odd index terms. This results in an L-shaped "butterfly" 
shown in Figure 9.3 which relates a length-N DFT to one length- 
N/2 DFT and two length-N/4 DFT's with twiddle factors. Re- 
peating this process for the half and quarter length DFT's until 
scalars result gives the SRFFT algorithm in much the same way 
the decimation-in-frequency radix-2 Cooley-Tukey FFT is derived 
[274], [299], [38]. The resulting flow graph for the algorithm cal- 
culated in place looks like a radix-2 FFT except for the location 
of the twiddle factors. Indeed, it is the location of the twiddle fac- 
tors that makes this algorithm use less arithmetic. The L- shaped 
SRFFT butterfly Figure 9.3 advances the calculation of the top half 
by one of the M stages while the lower half, like a radix-4 butter- 
fly, calculates two stages at once. This is illustrated for Af = 8 in 
Figure 9.4. 
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Figure 9.3: SRFFT Butterfly 
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Figure 9.4: Length-8 SRFFT 



Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey 
FFT or even the prime factor algorithm or Winograd Fourier trans- 
form algorithm , the Split-Radix FFT does not progress completely 
stage by stage, or, in terms of indices, does not complete each 
nested sum in order. This is perhaps better seen from the polyno- 
mial formulation of Martens [228]. Because of this, the indexing is 
somewhat more complicated than the conventional Cooley-Tukey 
program. 

A FORTRAN program is given below which implements the basic 
decimation-in-frequency split-radix FFT algorithm. The indexing 
scheme [350] of this program gives a structure very similar to the 
Cooley-Tukey programs in [64] and allows the same modifications 
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and improvements such as decimation-in-time, multiple Dutternies, 
table look-up of sine and cosine values, three real per complex 
multiply methods, and real data versions [102], [357]. 
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40 



30 



SUBROUTINE FFT(X,Y,N,M) 


M2 = 2*M 




DO 10 K = 


1, M-l 


M2 = M2/2 


N4 = N2/4 


E = 6. 


283185307179586/N2 


A = 




DO 20 


J = 1, N4 


A3 


= 3*A 


CC1 


= COS(A) 


SSI 


= SIN(A) 


CC3 


= C0S(A3) 


SS3 


= SIN(A3) 


A 


= J*E 


IS 


= J 


ID 


= 2*N2 


DO 


30 10 = IS, N-l, ID 




11 = 10 + N4 




12 = 11 + N4 




13 = 12 + N4 




Rl = X(IO) - X(I2) 




X(IO) = X(IO) + X(I2) 




R2 = X(I1) - X(I3) 




X(I1) = X(I1) + X(I3) 




SI = Y(IO) - Y(I2) 




Y(IO) = Y(IO) + Y(I2) 




S2 = Y(I1) - Y(I3) 




Y(I1) = Y(I1) + Y(I3) 




S3 = Rl - S2 




Rl = Rl + S2 




S2 = R2 - SI 




R2 = R2 + SI 




X(I2) = R1*CC1 - S2*SS1 




Y(I2) =-S2*CCl - R1*SS1 




X(I3) = S3*CC3 + R2*SS3 




Y(I3) = R2*CC3 - S3*SS3 


CONTINUE 
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As was done for the other decimation-in-frequency algorithms, me 
input index map is used and the calculations are done in place re- 
sulting in the output being in bit-reversed order. It is the three state- 
ments following label 30 that do the special indexing required by 
the SRFFT. The last stage is length- 2 and, therefore, inappropriate 
for the standard L-shaped butterfly, so it is calculated separately in 
the DO 60 loop. This program is considered a one-butterfly ver- 
sion. A second butterfly can be added just before statement 40 to 
remove the unnecessary multiplications by unity. A third butter- 
fly can be added to reduce the number of real multiplications from 
four to two for the complex multiplication when W has equal real 
and imaginary parts. It is also possible to reduce the arithmetic for 
the two- butterfly case and to reduce the data transfers by directly 
programming a length-4 and length-8 butterfly to replace the last 
three stages. This is called a two-butterfly-plus version. Operation 
counts for the one, two, two-plus and three butterfly SRFFT pro- 
grams are given in the next section. Some details can be found in 
[350]. 

The special case of a SRFFT for real data and symmetric data 
is discussed in [102]. An application of the decimation-in-time 
SRFFT to real data is given in [357]. Application to convolution 
is made in [1 10], to the discrete Hartley transform in [352], [110], 
to calculating the discrete cosine transform in [393], and could be 
made to calculating number theoretic transforms. 

An improvement in operation count has been reported by Johnson 
and Frigo [201] which involves a scaling of multiplying factors. 
The improvement is small but until this result, it was generally 
thought the Split-Radix FFT was optimal for total floating point 
operation count. 
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9.3 Evaluation of the Cooley-Tukey FFT Al- 
gorithms 

The evaluation of any FFT algorithm starts with a count of the real 
(or floating point) arithmetic. Table 9.1 gives the number of real 
multiplications and additions required to calculate a length-N FFT 
of complex data. Results of programs with one, two, three and five 
butterflies are given to show the improvement that can be expected 
from removing unnecessary multiplications and additions. Results 
of radices two, four, eight and sixteen for the Cooley-Tukey FFT 
as well as of the split-radix FFT are given to show the relative 
merits of the various structures. Comparisons of these data should 
be made with the table of counts for the PFA and WFTA programs 
in The Prime Factor and Winograd Fourier Transform Algorithms: 
Evaluation of the PFA and WFTA (Section 10.4: Evaluation of 
the PFA and WFTA). All programs use the four-multiply-two-add 
complex multiply algorithm. A similar table can be developed for 
the three-multiply-three-add algorithm, but the relative results are 
the same. 

From the table it is seen that a greater improvement is obtained go- 
ing from radix-2 to 4 than from 4 to 8 or 16. This is partly because 
length 2 and 4 butterflies have no multiplications while length 8, 
16 and higher do. It is also seen that going from one to two but- 
terflies gives more improvement than going from two to higher 
values. From an operation count point of view and from practical 
experience, a three butterfly radix-4 or a two butterfly radix-8 FFT 
is a good compromise. The radix-8 and 16 programs become long, 
especially with multiple butterflies, and they give a limited choice 
of transform length unless combined with some length 2 and 4 but- 
terflies. 
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N 


Ml 


M2 


M3 


M5 


Al 


A2 


ALGOl 

A3 


UTHM 

A5 




2 


4 











6 


4 


4 


4 




4 


16 


4 








24 


18 


16 


16 




8 


48 


20 


8 


4 


72 


58 


52 


52 




16 


128 


68 


40 


28 


192 


162 


148 


148 




32 


320 


196 


136 


108 


480 


418 


388 


388 




64 


768 


516 


392 


332 


1152 


1026 


964 


964 




128 


1792 


1284 


1032 


908 


2688 


2434 


2308 


2308 




256 


4096 


3076 


2568 


2316 


6144 


5634 


5380 


5380 




512 


9216 


7172 


6152 


5644 


13824 


12802 


12292 


12292 




1024 


20480 


16388 


14344 


13324 


30720 


28674 


27652 


27652 




2048 


45056 


36868 


32776 


30732 


67584 


6349C 


61444 


61444 




4096 


98304 


81924 


73736 


69644 


14745 


613926 


613517 


213517 


2 


4 


12 











22 


16 


16 


16 




16 


96 


36 


28 


24 


176 


146 


144 


144 




64 


576 


324 


284 


264 


1056 


930 


920 


920 
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256 


3072 


2052 


1884 


1800 


5632 


5122 


5080 


5080 




1024 


15360 


11268 


10588 


10248 


28160 


26114 


25944 


25944 




4096 


73728 


57348 


54620 


53256 


13516 


812697 


812629 


612629 


6 


8 


32 


4 


4 


4 


66 


52 


52 


52 




64 


512 


260 


252 


248 


1056 


930 


928 


928 




512 


6144 


4100 


4028 


3992 


12672 


1165C 


11632 


11632 




4096 


65536 


49156 


48572 


4828C 


13516 


812697 


812683 


212683 


2 


16 


80 


20 


20 


20 


178 


148 


148 


148 




256 


2560 


1540 


1532 


1528 


5696 


5186 


5184 


5184 




4096 


61440 


4506C 


44924 


44856 


13670 


412851 


412848 


012848 





2 














4 


4 


4 


4 




4 


8 











20 


16 


16 


16 




8 


24 


8 


4 


4 


60 


52 


52 


52 




16 


72 


32 


28 


24 


164 


144 


144 


144 




32 


184 


104 


92 


84 


412 


372 


372 


372 




64 


456 


288 


268 


248 


996 


912 


912 


912 




128 


1080 


744 


700 


660 


2332 


2164 


2164 


2164 




256 


2504 


1824 


1740 


1656 


5348 


5008 


5008 


5008 
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512 


5688 


4328 


4156 


3988 


12060 


1138C 


ALGORITHM 
11380 11380 




1024 


12744 


10016 9676 


9336 


26852 


25488 


25488 


25488 




2048 


28216 


22760 


22076 


21396 59164 


56436 56436 


56436 




4096 


61896 50976 49612 48248 


129252123792123792123792 



Table 9.1: Number of Real Multiplications and Additions for 
Complex Single Radix FFTs 

In Table 9.1 Mi and Ai refer to the number of real multiplications 
and real additions used by an FFT with i separately written but- 
terflies. The first block has the counts for Radix-2, the second for 
Radix-4, the third for Radix-8, the fourth for Radix- 16, and the 
last for the Split-Radix FFT. For the split-radix FFT, M3 and A3 
refer to the two- butterfly-plus program and M5 and A5 refer to the 
three-butterfly program. 

The first evaluations of FFT algorithms were in terms of the num- 
ber of real multiplications required as that was the slowest op- 
eration on the computer and, therefore, controlled the execution 
speed. Later with hardware arithmetic both the number of multi- 
plications and additions became important. Modern systems have 
arithmetic speeds such that indexing and data transfer times be- 
come important factors. Morris [249] has looked at some of these 
problems and has developed a procedure called autogen to write 
partially straight-line program code to significantly reduce over- 
head and speed up FFT run times. Some hardware, such as the 
TMS320 signal processing chip, has the multiply and add opera- 
tions combined. Some machines have vector instructions or have 
parallel processors. Because the execution speed of an FFT de- 
pends not only on the algorithm, but also on the hardware archi- 
tecture and compiler, experiments must be run on the system to be 
used. 

In many cases the unscrambler or bit-reverse-counter requires 10% 
of the execution time, therefore, if possible, it should be elimi- 
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nated. In high-speed convolution where the convolution is done 
by multiplication of DFT's, a decimation-in-frequency FFT can be 
combined with a decimation-in-time inverse FFT to require no un- 
scrambler. It is also possible for a radix-2 FFT to do the unscram- 
bling inside the FFT but the structure is not very regular [299], 
[193]. Special structures can be found in [299] and programs for 
data that are real or have special symmetries are in [82], [102], 
[357]. 

Although there can be significant differences in the efficiencies 
of the various Cooley-Tukey and Split-Radix FFTs, the number 
of multiplications and additions for all of them is on the order of 
NlogN. That is fundamental to the class of algorithms. 

9.4 The Quick Fourier Transform, An FFT 
based on Symmetries 

The development of fast algorithms usually consists of using spe- 
cial properties of the algorithm of interest to remove redundant or 
unnecessary operations of a direct implementation. The discrete 
Fourier transform (DFT) defined by 

C(*)= j>(n)W#* (9.16) 

n=0 



where 



W N = e~ j27t/N (9.17) 



has enormous capacity for improvement of its arithmetic effi- 
ciency. Most fast algorithms use the periodic and symmetric prop- 
erties of its basis functions. The classical Cooley-Tukey FFT and 
prime factor FFT [64] exploit the periodic properties of the co- 
sine and sine functions. Their use of the periodicities to share and, 
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therefore, reduce arithmetic operations depends on me faetorabil- 
ity of the length of the data to be transformed. For highly com- 
posite lengths, the number of floating-point operation is of order 
N log (N) and for prime lengths it is of order N 2 . 

This section will look at an approach using the symmetric proper- 
ties to remove redundancies. This possibility has long been rec- 
ognized [176], [211], [344], [270] but has not been developed in 
any systematic way in the open literature. We will develop an al- 
gorithm, called the quick Fourier transform (QFT) [211], that will 
reduce the number of floating point operations necessary to com- 
pute the DFT by a factor of two to four over direct methods or 
Goertzel's method for prime lengths. Indeed, it seems the best 
general algorithm available for prime length DFTs. One can al- 
ways do better by using Winograd type algorithms but they must 
be individually designed for each length. The Chirp Z-transform 
can be used for longer lengths. 

9.4.1 Input and Output Symmetries 

We use the fact that the cosine is an even function and the sine is 
an odd function. The kernel of the DFT or the basis functions of 
the expansion is given by 

Wjf = e ~ j27tnk/N = cos (27tnk/N) + j sin (27tnk/N) (9.18) 

which has an even real part and odd imaginary part. If the data 
x (n) are decomposed into their real and imaginary parts and those 
into their even and odd parts, we have 

x(n) = u(n) + jv(n) = [u e (n) + u («)] + (9.19) 
j [v e (n) + v 0)] 
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where the even part of the real part of x (n) is given by 

u e (n) = (u (n) + u (-n)) /2 (9.20) 

and the odd part of the real part is 

u (n) = (u(n)-u(-n))/2 (9.21) 

with corresponding definitions of v e (n) and v (ri). Using Con- 
volution Algorithms: Equation 32 (13.32) with a simpler notation, 
the DFT of Convolution Algorithms: Equation 29 (13.29) becomes 

N-l 

C (k) = £ ( M + J v ) ( cos - J sin ) ■ ( 9 -22) 

n=0 

The sum over an integral number of periods of an odd function is 
zero and the sum of an even function over half of the period is one 
half the sum over the whole period. This causes (9.16) and (9.22) 
to become 

AT/2-1 

C(k)= J2 [u e cos + v sin] + j[v e cos — v sin]. (9.23) 

n=0 

far* = 0,l,2,--- ,N-1. 

The evaluation of the DFT using equation (9.23) requires half as 
many real multiplication and half as many real additions as evalu- 
ating it using (9.16) or (9.22). We have exploited the symmetries 
of the sine and cosine as functions of the time index n. This is 
independent of whether the length is composite or not. Another 
view of this formulation is that we have used the property of asso- 
ciatively of multiplication and addition. In other words, rather than 
multiply two data points by the same value of a sine or cosine then 
add the results, one should add the data points first then multiply 
the sum by the sine or cosine which requires one rather than two 
multiplications. 
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Next we take advantage or the symmetries or the sine and cosine 

as functions of the frequency index k. Using these symmetries on 

(9.23) gives 

N/2-1 

C (k) — J2 t Me cos ~*~ v ° 5rn ] ~*~ J t Ve cos ~ v ° 5 ' n ] (9.24) 

Af/2-1 

C(N — k) — ^ [ M e c<95 ~~ v o >sw] + y [vg co5 + v„ sin] . (9.25) 

for k — 0, 1,2, • • • ,N/2 — 1. This again reduces the number of 
operations by a factor of two, this time because it calculates two 
output values at a time. The first reduction by a factor of two is 
always available. The second is possible only if both DFT values 
are needed. It is not available if you are calculating only one DFT 
value. The above development has not dealt with the details that 
arise with the difference between an even and an odd length. That 
is straightforward. 



9.4.2 Further Reductions if the Length is Even 

If the length of the sequence to be transformed is even, there are 
further symmetries that can be exploited. There will be four data 
values that are all multiplied by plus or minus the same sine or 
cosine value. This means a more complicated pre-addition pro- 
cess which is a generalization of the simple calculation of the even 
and odd parts in (9.20) and (9.21) will reduce the size of the order 
Af 2 part of the algorithm by still another factor of two or four. It 
the length is divisible by 4, the process can be repeated. Indeed, 
it the length is a power of 2, one can show this process is equiv- 
alent to calculating the DFT in terms of discrete cosine and sine 
transforms [156], [159] with a resulting arithmetic complexity of 
order ,/V log (N) and with a structure that is well suited to real data 
calculations and pruning. 
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If the flow-graph of the Cooley-Tukey FFT is compared to the 
flow-graph of the QFT, one notices both similarities and differ- 
ences. Both progress in stages as the length is continually divided 
by two. The Cooley-Tukey algorithm uses the periodic properties 
of the sine and cosine to give the familiar horizontal tree of butter- 
flies. The parallel diagonal lines in this graph represent the parallel 
stepping through the data in synchronism with the periodic basis 
functions. The QFT has diagonal lines that connect the first data 
point with the last, then the second with the next to last, and so 
on to give a "star" like picture. This is interesting in that one can 
look at the flow graph of an algorithm developed by some com- 
pletely different strategy and often find section with the parallel 
structures and other parts with the star structure. These must be 
using some underlying periodic and symmetric properties of the 
basis functions. 

9.4.3 Arithmetic Complexity and Timings 

A careful analysis of the QFT shows that 2N additions are neces- 
sary to compute the even and odd parts of the input data. This is 
followed by the length N/2 inner product that requires 4(N /2) — 
N 2 real multiplications and an equal number of additions. This is 
followed by the calculations necessary for the simultaneous cal- 
culations of the first half and last half of C(k) which requires 
4(N/2) = 2N real additions. This means the total QFT algo- 
rithm requires M 2 real multiplications and N 2 + 4N real additions. 
These numbers along with those for the Goertzel algorithm [52], 
[64], [270] and the direct calculation of the DFT are included in 
the following table. Of the various order-A^ 2 DFT algorithms, the 
QFT seems to be the most efficient general method for an arbitrary 
length N. 
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Algorithm 


Real Mults. 


Real Adds 


ALGORITHM 

Trig Eval. 










Direct DFT 


AN 2 


4N 2 


2N 2 


Mod. 2nd Or- 
der Goertzel 


N 2 +N 


2N 2 + N 


N 


QFT 


N 2 


N 2 + 4N 


2N 











Table 9.2 

Timings of the algorithms on a PC in milliseconds are given in the 
following table. 



Algorithm 


N = 125 


N = 256 








Direct DFT 


4.90 


19.83 


Mod. 20. Goertzel 


1.32 


5.55 


QFT 


1.09 


4.50 


Chirp + FFT 


1.70 


3.52 









Table 9.3 

These timings track the floating point operation counts fairly well. 



9.4.4 Conclusions 

The QFT is a straight-forward DFT algorithm that uses all of the 
possible symmetries of the DFT basis function with no require- 
ments on the length being composite. These ideas have been pro- 
posed before, but have not been published or clearly developed by 
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[211], [344], [342], [168]. It seems that the basic QFT is practical 
and useful as a general algorithm for lengths up to a hundred or so. 
Above that, the chirp z-transform [64] or other filter based meth- 
ods will be superior. For special cases and shorter lengths, methods 
based on Winograd's theories will always be superior. Neverthe- 
less, the QFT has a definite place in the array of DFT algorithms 
and is not well known. A Fortran program is included in the ap- 
pendix. 

It is possible, but unlikely, that further arithmetic reduction could 
be achieved using the fact that W/v has unity magnitude as was 
done in second-order Goertzel algorithm. It is also possible that 
some way of combining the Goertzel and QFT algorithm would 
have some advantages. A development of a complete QFT decom- 
position of a DFT of length-2 M shows interesting structure [156], 
[159] and arithmetic complexity comparable to average Cooley- 
Tukey FFTs. It does seem better suited to real data calculations 
with pruning. 
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Chapter 10 

The Prime Factor and 
Winograd Fourier Transform 
Algorithms 1 



The prime factor algorithm (PFA) and the Winograd Fourier trans- 
form algorithm (WFTA) are methods for efficiently calculating the 
DFT which use, and in fact, depend on the Type-1 index map from 
Multidimensional Index Mapping: Equation 10 (3.10) and Mul- 
tidimensional Index Mapping: Equation 6 (3.6). The use of this 
index map preceded Cooley and Tukey's paper [150], [302] but its 
full potential was not realized until it was combined with Wino- 
grad's short DFT algorithms. The modern PFA was first presented 
in [213] and a program given in [57]. The WFTA was first pre- 
sented in [407] and programs given in [236], [83]. 

The number theoretic basis for the indexing in these algorithms 
may, at first, seem more complicated than in the Cooley-Tukey 
FFT; however, if approached from the general index mapping point 
of view of Multidimensional Index Mapping (Chapter 3), it is 
straightforward, and part of a common approach to breaking large 



'This content is available online at <http://cnx.org/content/ml6335/L9/>. 
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problems into smaller ones. The development in this section will 

parallel that in The Cooley-Tukey Fast Fourier Transform Algo- 
rithm (Chapter 9). 

The general index maps of Multidimensional Index Mapping: 
Equation 6 (3.6) and Multidimensional Index Mapping: Equa- 
tion 12 (3.12) must satisfy the Type-1 conditions of Multidimen- 
sional Index Mapping: Equation 7 (3.7) and Multidimensional In- 
dex Mapping: Equation 10 (3.10) which are 

K x = aN 2 and K 2 = bN\ with {K U N\) = (K 2 ,N 2 ) = 1 (10.1) 

K 3 = cN 2 and K 4 = dN { with (K 3 ,Ni) = (K 4 ,N 2 ) = 1 (10.2) 

The row and column calculations in Multidimensional Index Map- 
ping: Equation 15 (3.15) are uncoupled by Multidimensional Index 
Mapping: Equation 16 (3.16) which for this case are 

((K 1 K4)) N =((K 2 K 3 )) N = (10.3) 

In addition, to make each short sum a DFT, the Kj must also satisfy 

((KiK 3 )) N = N 2 and ((K 2 K 4 )) N = Ni (10.4) 

In order to have the smallest values for Kj, the constants in (10.1) 
are chosen to be 



a = b=l, c={{iq 1 )) N , d={(N- l )) N (10.5) 

which gives for the index maps in (10.1) 



n = ((N 2 n l +Nm 2 )) N (10.6) 

k=((K 3 k 1 +K 4 k 2 )) N (10.7) 
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The frequency index map is a form of the Chinese remainder theo- 
rem. Using these index maps, the DFT in Multidimensional Index 
Mapping: Equation 15 (3.15) becomes 

N 2 -iNi-l 
X = £ £ x W n N \ h W^ (10.8) 

«2=0"i=0 

which is a pure two-dimensional DFT with no twiddle factors and 
the summations can be done in either order. Choices other than 
(10.5) could be used. For example, a = b = c = d= 1 will cause 
the input and output index map to be the same and, therefore, there 
will be no scrambling of the output order. The short summations 
in (96), however, will no longer be short DFT's [57]. 

An important feature of the short Winograd DFT's described in 
Winograd's Short DFT Algorithms (Chapter 7) that is useful for 
both the PFA and WFTA is the fact that the multiplier constants 
in Winograd's Short DFT Algorithms: Equation 6 (7.6) or Wino- 
grad's Short DFT Algorithms: Equation 8 (7.8) are either real or 
imaginary, never a general complex number. For that reason, mul- 
tiplication by complex data requires only two real multiplications, 
not four. That is a very significant feature. It is also true that the 
j multiplier can be commuted from the D operator to the last part 
of the A T operator. This means the D operator has only real mul- 
tipliers and the calculations on real data remains real until the last 
stage. This can be seen by examining the short DFT modules in 
[65], [198] and in the appendices. 

10.1 The Prime Factor Algorithm 

If the DFT is calculated directly using (10.8), the algorithm is 
called a prime factor algorithm [150], [302] and was discussed 
in Winograd's Short DFT Algorithms (Chapter 7) and Multidi- 
mensional Index Mapping: In-Place Calculation of the DFT and 
Scrambling (Section 3.2: In-Place Calculation of the DFT and 
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Scrambling). When the short DFTs are calculated oy me very 
efficient algorithms of Winograd discussed in Factoring the Signal 
Processing Operators (Chapter 6), the PFA becomes a very pow- 
erful method that is as fast or faster than the best Cooley-Tukey 
FFT's [57], [213]. 

A flow graph is not as helpful with the PFA as it was with the 
Cooley-Tukey FFT, however, the following representation in Fig- 
ure 10.1 which combines Figures Multidimensional Index Map- 
ping: Figure 1 (Figure 3.1) and Winograd 's Short DFT Algorithms: 
Figure 2 (Figure 7.2) gives a good picture of the algorithm with the 
example of Multidimensional Index Mapping: Equation 25 (3.25) 



3 y A / < 




+ 



+ 



Figure 10.1: A Prime Factor FFT for N = 15 



If N is factored into three factors, the DFT of (10.8) would have 
three nested summations and would be a three-dimensional DFT. 
This principle extends to any number of factors; however, recall 
that the Type-1 map requires that all the factors be relatively prime. 
A very simple three-loop indexing scheme has been developed [57] 
which gives a compact, efficient PFA program for any number of 
factors. The basic program structure is illustrated in p. ?? with 
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the short DFT's being omitted for clarity. Complete programs are 
given in [65] and in the appendices. 
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C PFA INDEXING LOOPS 

DO 10 K = 1, M 
Nl = NI(K) 
N2 = N/Nl 
Kl) = 1 
DO 20 J = 1, N2 
DO 30 L=2, Nl 

I(L) = I(L-l) + N2 
IF (I(L .GT.N) I(L) = I(L) - N 
30 CONTINUE 

GOTO (20,102,103,104,105), Nl 
1(1) = 1(1) + Nl 
20 CONTINUE 
10 CONTINUE 
RETURN 
C MODULE FOR N=2 

102 Rl = X(I(D) 
X(I(D) = Rl + X(I(2)) 
X(I(2)) = Rl - X(I(2)) 
Rl = Y(I(D) 
Y(I(D) = Rl + Y(I(2)) 
Y(I(2)) = Rl - Y(I(2)) 
GOTO 20 

C OTHER MODULES 

103 Length-3 DFT 

104 Length-4 DFT 

105 Length-5 DFT 
etc. 



Listing 10.1: Part of a FORTRAN PFA Program 
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As in the Cooley-Tukey program, the DO 10 loop steps through 
the M stages (factors of N) and the DO 20 loop calculates the N/Nl 
length-Nl DFT's. The input index map of (10.6) is implemented in 
the DO 30 loop and the statement just before label 20. In the PFA, 
each stage or factor requires a separately programmed module or 
butterfly. This lengthens the PFA program but an efficient Cooley- 
Tukey program will also require three or more butterflies. 

Because the PFA is calculated in-place using the input index map, 
the output is scrambled. There are five approaches to dealing with 
this scrambled output. First, there are some applications where the 
output does not have to be unscrambled as in the case of high-speed 
convolution. Second, an unscrambler can be added after the PFA 
to give the output in correct order just as the bit-reversed-counter 
is used for the Cooley-Tukey FFT A simple unscrambler is given 
in [65], [57] but it is not in place. The third method does the un- 
scrambling in the modules while they are being calculated. This is 
probably the fastest method but the program must be written for a 
specific length [65], [57]. A fourth method is similar and achieves 
the unscrambling by choosing the multiplier constants in the mod- 
ules properly [198]. The fifth method uses a separate indexing 
method for the input and output of each module [65], [320]. 

10.2 The Winograd Fourier Transform Al- 
gorithm 

The Winograd Fourier transform algorithm (WFTA) uses a very 
powerful property of the Type- 1 index map and the DFT to give 
a further reduction of the number of multiplications in the PFA. 
Using an operator notation where F\ represents taking row DFT's 
and F2 represents column DFT's, the two-factor PFA of (10.8) is 
represented by 

X = F 2 Fi x (10.9) 
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It has been shown [410], [190] that it each operator represents 

identical operations on each row or column, they commute. Since 

Fi and F2 represent length Nx and N 2 DFT's, they commute and 

(10.9) can also be written 

X = Fi F 2 x (10.10) 

If each short DFT in F is expressed by three operators as in Wino- 
grad's Short DFT Algorithms: Equation 8 (7.8) and Winograd's 
Short DFT Algorithms: Figure 2 (Figure 7.2), F can be factored as 

F=A T DA (10.11) 

where A represents the set of additions done on each row or col- 
umn that performs the residue reduction as Winograd's Short DFT 
Algorithms: Equation 30 (7.30). Because of the appearance of the 
flow graph of A and because it is the first operator on x, it is called a 
preweave operator [236]. D is the set of M multiplications and A T 
(or B T or C T ) from Winograd's Short DFT Algorithms: Equation 5 
(7.5) or Winograd's Short DFT Algorithms: Equation 6 (7.6) is the 
reconstruction operator called the postweave. Applying (10.11) to 
(10.9) gives 

X=A\ D 2 A 2 A\ Di A x x (10.12) 

This is the PFA of (10.8) and Figure 10.1 whereAiDiAi represents 
the row DFT's on the array formed from x. Because these operators 
commute, (10.12) can also be written as 

X=A\ A\ D 2 Di A 2 A t x (10.13) 

or 

X=Aj A t 2 D 2 Di A 2 A x x (10.14) 

but the two adjacent multiplication operators can be premultiplied 
and the result represented by one operator D — D 2 D\ which is no 
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longer the same for each row or column. Equation (10.14) becomes 

X=A[ A\ DA 2 A X x (10.15) 

This is the basic idea of the Winograd Fourier transform algorithm. 
The commuting of the multiplication operators together in the cen- 
ter of the algorithm is called nesting and it results in a significant 
decrease in the number of multiplications that must be done at the 
execution of the algorithm. Pictorially, the PFA of Figure 10.1 be- 
comes [213] the WFTA in Figure 10.2. 




Figure 10.2: A Length- 15 WFTA with Nested Multiplica- 
tions 



The rectangular structure of the preweave addition operators 
causes an expansion of the data in the center of the algorithm. The 
15 data points in Figure 10.2 become 18 intermediate values. This 
expansion is a major problem in programming the WFTA because 
it prevents a straightforward in-place calculation and causes an in- 
crease in the number of required additions and in the number of 
multiplier constants that must be precalculated and stored. 

From Figure 10.2 and the idea of premultiplying the individual 
multiplication operators, it can be seen why the multiplications by 
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unity had to be considered in Winograd s ^noriDFT Algorithms: 
Table 1 (Table 7.1). Even if a multiplier in D\ is unity, it may not 
be in D2D1. In Figure 10.2 with factors of three and five, there 
appear to be 18 multiplications required because of the expansion 
of the length-5 preweave operator, A%, however, one of multipliers 
in each of the length three and five operators is unity, so one of 
the 18 multipliers in the product is unity. This gives 17 required 
multiplications - a rather impressive reduction from the 15 2 — 225 
multiplications required by direct calculation. This number of 17 
complex multiplications will require only 34 real multiplications 
because, as mentioned earlier, the multiplier constants are purely 
real or imaginary while the 225 complex multiplications are gen- 
eral and therefore will require four times as many real multiplica- 
tions. 

The number of additions depends on the order of the pre- and 
postweave operators. For example in the length- 15 WFTA in Fig- 
ure 10.2, if the length-5 had been done first and last, there would 
have been six row addition preweaves in the preweave operator 
rather than the five shown. It is difficult to illustrate the algorithm 
for three or more factors of N, but the ideas apply to any number 
of factors. Each length has an optimal ordering of the pre- and 
postweave operators that will minimize the number of additions. 

A program for the WFTA is not as simple as for the FFT or PFA 
because of the very characteristic that reduces the number of mul- 
tiplications, the nesting. A simple two-factor example program is 
given in [65] and a general program can be found in [236], [83]. 
The same lengths are possible with the PFA and WFTA and the 
same short DFT modules can be used, however, the multiplies in 
the modules must occur in one place for use in the WFTA. 
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10.3 Modifications of the PFA and WFTA 
Type Algorithms 

In the previous section it was seen how using the permutation prop- 
erty of the elementary operators in the PFA allowed the nesting of 
the multiplications to reduce their number. It was also seen that 
a proper ordering of the operators could minimize the number of 
additions. These ideas have been extended in formulating a more 
general algorithm optimizing problem. If the DFT operator F in 
(10.11) is expressed in a still more factored form obtained from 
Winograd's Short DFT Algorithms: Equation 30 (7.30), a greater 
variety of ordering can be optimized. For example if the A opera- 
tors have two factors 

Fi=A\A? Di A\A x (10.16) 

The DFT in (10.10) becomes 

X = A T 2 A 2 T D 2 A 2 A 2 A T X A\ T D^^x (10.17) 

The operator notation is very helpful in understanding the central 
ideas, but may hide some important facts. It has been shown [410], 
[198] that operators in different Ft commute with each other, but 
the order of the operators within an Fj cannot be changed. They 
represent the matrix multiplications in Winograd's Short DFT Al- 
gorithms: Equation 30 (7.30) or Winograd's Short DFT Algo- 
rithms: Equation 8 (7.8) which do not commute. 

This formulation allows a very large set of possible orderings, in 
fact, the number is so large that some automatic technique must 
be used to find the "best". It is possible to set up a criterion of 
optimality that not only includes the number of multiplications but 
the number of additions as well. The effects of relative multiply- 
add times, data transfer times, CPU register and memory sizes, 
and other hardware characteristics can be included in the criterion. 
Dynamic programming can then be applied to derive an optimal 
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algorithm lor a particular application 1T9UJ. This is a very interest- 
ing idea as there is no longer a single algorithm, but a class and an 
optimizing procedure. The challenge is to generate a broad enough 
class to result in a solution that is close to a global optimum and to 
have a practical scheme for finding the solution. 

Results obtained applying the dynamic programming method to 
the design of fairly long DFT algorithms gave algorithms that 
had fewer multiplications and additions than either a pure PFA 
or WFTA [190]. It seems that some nesting is desirable but not 
total nesting for four or more factors. There are also some interest- 
ing possibilities in mixing the Cooley-Tukey with this formulation. 
Unfortunately, the twiddle factors are not the same for all rows and 
columns, therefore, operations cannot commute past a twiddle fac- 
tor operator. There are ways of breaking the total algorithm into 
horizontal paths and using different orderings along the different 
paths [264], [198]. In a sense, this is what the split-radix FFT does 
with its twiddle factors when compared to a conventional Cooley- 
Tukey FFT. 

There are other modifications of the basic structure of the Type-1 
index map DFT algorithm. One is to use the same index struc- 
ture and conversion of the short DFT's to convolution as the PFA 
but to use some other method for the high-speed convolution. Ta- 
ble look-up of partial products based on distributed arithmetic to 
eliminate all multiplications [78] looks promising for certain very 
specific applications, perhaps for specialized VLSI implementa- 
tion. Another possibility is to calculate the short convolutions us- 
ing number-theoretic transforms [30], [236], [264]. This would 
also require special hardware. Direct calculation of short convolu- 
tions is faster on certain pipelined processor such as the TMS-320 
microprocessor [216]. 
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10.4 Evaluation of the PFA and WFTA 

As for the Cooley-Tukey FFT's, the first evaluation of these al- 
gorithms will be on the number of multiplications and additions 
required. The number of multiplications to compute the PFA in 
(10.8) is given by Multidimensional Index Mapping: Equation 3 
(3.3). Using the notation that T (N) is the number of multiplica- 
tions or additions necessary to calculate a length-N DFT, the total 
number for a four-factor PFA of length-N, where N — N1N2N3N4 
is 

T(N) = NiN 2 N 3 T(N 4 ) + N 2 N 3 N 4 T (Ni) + (10.18) 
N 3 N 4 Ni T (N 2 ) + N4N1N2T (iV 3 ) 

The count of multiplies and adds in Table 10.1 are calculated from 
(105) with the counts of the factors taken from Winograd's Short 
DFT Algorithms: Table 1 (Table 7.1). The list of lengths are those 
possible with modules in the program of length 2, 3, 4, 5, 7, 8, 
9 and 16 as is true for the PFA in [65], [57] and the WFTA in 
[236], [83]. A maximum of four relatively prime lengths can be 
used from this group giving 59 different lengths over the range 
from 2 to 5040. The radix-2 or split-radix FFT allows 12 different 
lengths over the same range. If modules of length 1 1 and 13 from 
[188] are added, the maximum length becomes 720720 and the 
number of different lengths becomes 239. Adding modules for 17, 
19 and 25 from [188] gives a maximum length of 1 163962800 and 
a very large and dense number of possible lengths. The length of 
the code for the longer modules becomes excessive and should not 
be included unless needed. 

The number of multiplications necessary for the WFTA is simply 
the product of those necessary for the required modules, includ- 
ing multiplications by unity. The total number may contain some 
unity multipliers but it is difficult to remove them in a practical 
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program. Table 10.1 contains both tne form number (MUtliS) ana 
the number with the unity multiplies removed (RMULTS). 

Calculating the number of additions for the WFTA is more compli- 
cated than for the PFA because of the expansion of the data moving 
through the algorithm. For example the number of additions, TA, 
for the length- 15 example in Figure 10.2 is given by 



TA (N) = N 2 TA (Ni ) + TM X TA (N 2 ) 



(10.19) 



where N[ — 3, Nj — 5, TM\ = the number of multiplies for the 
length-3 module and hence the expansion factor. As mentioned 
earlier there is an optimum ordering to minimize additions. The 
ordering used to calculate Table 10.1 is the ordering used in [236], 
[83] which is optimal in most cases and close to optimal in the 
others. 
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PFA 


WFTA 


WFTA 


WFTA 


N 


Mults 


Adds 


Mults 
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10 


20 


88 


24 


20 


88 


12 


16 


96 


24 


16 


96 


14 


32 


172 


36 


32 


172 


15 


50 
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36 


34 


162 


18 


40 


204 


44 


40 
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20 


40 


216 


48 


40 


216 


21 


76 


300 


54 


52 


300 


24 


44 


252 


48 


36 


252 


28 


64 


400 


72 


64 


400 
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72 


68 


384 
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106 
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36 
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88 


80 
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84 
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940 
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144 
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888 
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284 
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198 
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70 


300 
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72 


196 


1140 
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80 


260 
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216 
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84 
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1536 


216 
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2214 
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460 
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392 
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432 


420 
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648 


644 
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240 


1100 
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648 


632 


5136 


252 


1136 


5952 


792 


784 
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280 


1340 
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864 


852 


7148 


315 


2050 


8322 


1188 


1186 


10336 


336 


1636 
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972 


956 


8508 


360 


1700 


8148 


1056 


1044 
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420 
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10536 


1296 


1288 


11352 


504 


2524 


13164 


1584 


1572 
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3100 
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1944 
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4100 
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2376 
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21932 


720 


3940 


18276 


2376 


2360 
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840 
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2592 


2580 


24804 


1008 


5804 


29100 


3564 


3548 


34416 


1260 


8200 


38328 


4752 


4744 
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1680 


11540 


50964 


5832 


5816 


59064 


2520 


17660 


82956 


9504 


9492 


99068 


5040 


39100 


179772 


21384 


21368 


232668 



Table 10.1: Number of Real Multiplications and Additions for 
Complex PFA and WFTA FFTs 

from Table 10.1 we see that compared to the PFA or any of the 
Cooley-Tukey FFT's, the WFTA has significantly fewer multipli- 
cations. For the shorter lengths, the WFTA and the PFA have 
approximately the same number of additions; however for longer 
lengths, the PFA has fewer and the Cooley-Tukey FFT's always 
have the fewest. If the total arithmetic, the number of multipli- 
cations plus the number of additions, is compared, the split-radix 
FFT, PFA and WFTA all have about the same count. Special ver- 
sions of the PFA and WFTA have been developed for real data 
[178], [358]. 

The size of the Cooley-Tukey program is the smallest, the PFA 
next and the WFTA largest. The PFA requires the smallest num- 
ber of stored constants, the Cooley-Tukey or split-radix FFT next, 
and the WFTA requires the largest number. For a DFT of approx- 
imately 1000, the PFA stores 28 constants, the FFT 2048 and the 
WFTA 3564. Both the FFT and PFA can be calculated in-place and 
the WFTA cannot. The PFA can be calculated in-order without an 
unscrambler. The radix-2 FFT can also, but it requires additional 
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indexing overhead [194]. The indexing and data transfer overhead 

is greatest for the WFTA because the separate preweave and post- 
weave sections each require their indexing and pass through the 
complete data. The shorter modules in the PFA and WFTA and 
the butterflies in the radix 2 and 4 FFT's are more efficient than 
the longer ones because intermediate calculations can be kept in 
cpu registers rather general memory [250]. However, the shorter 
modules and radices require more passes through the data for a 
given approximate length. A proper comparison will require actual 
programs to be compiled and run on a particular machine. There 
are many open questions about the relationship of algorithms and 
hardware architecture. 



Chapter 11 

Implementing FFTs in 
Practice 



by Steven G. Johnson (Department of Mathematics, Massachusetts 
Institute of Technology) and Matteo Frigo (Cilk Arts, Inc.) 

11.1 Introduction 

Although there are a wide range of fast Fourier transform (FFT) 
algorithms, involving a wealth of mathematics from number the- 
ory to polynomial algebras, the vast majority of FFT implementa- 
tions in practice employ some variation on the Cooley-Tukey al- 
gorithm [92]. The Cooley-Tukey algorithm can be derived in two 
or three lines of elementary algebra. It can be implemented almost 
as easily, especially if only power-of-two sizes are desired; numer- 
ous popular textbooks list short FFT subroutines for power-of-two 
sizes, written in the language du jour. The implementation of the 
Cooley-Tukey algorithm, at least, would therefore seem to be a 
long-solved problem. In this chapter, however, we will argue that 
matters are not as straightforward as they might appear. 



This content is available online at <http://cnx.org/content/ml6336/L15/>. 
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For many years, the primary route to improving upon the Cooley- 
Tukey FFT seemed to be reductions in the count of arithmetic op- 
erations, which often dominated the execution time prior to the 
ubiquity of fast floating-point hardware (at least on non-embedded 
processors). Therefore, great effort was expended towards find- 
ing new algorithms with reduced arithmetic counts [114], from 
Winograd's method to achieve (n) multiplications 2 (at the cost of 
many more additions) [411], [180], [116], [114] to the split-radix 
variant on Cooley-Tukey that long achieved the lowest known to- 
tal count of additions and multiplications for power-of-two sizes 
[422], [107], [391], [230], [114] (but was recently improved upon 
[202], [225]). The question of the minimum possible arithmetic 
count continues to be of fundamental theoretical interest — it is not 
even known whether better than (nlogn) complexity is possible, 
since Q. (nlogn) lower bounds on the count of additions have only 
been proven subject to restrictive assumptions about the algorithms 
[248], [280], [281]. Nevertheless, the difference in the number of 
arithmetic operations, for power-of-two sizes n, between the 1965 
radix-2 Cooley-Tukey algorithm (~ 5nlog 2 n [92]) and the cur- 
rently lowest-known arithmetic count (~ ^nlog 2 n [202], [225]) 
remains only about 25%. 



2 We employ the standard asymptotic notation of O for asymptotic upper 
bounds, for asymptotic tight bounds, and Q. for asymptotic lower bounds 
[210]. 
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Figure 11.1: The ratio of speed ( 1/time) between a highly op- 
timized FFT (FFTW 3.1.2 [133], [134]) and a typical textbook 
radix-2 implementation (Numerical Recipes in C [290]) on a 
3 GHz Intel Core Duo with the Intel C compiler 9.1.043, for 
single-precision complex-data DFTs of size n, plotted versus 
log 2 n. Top line (squares) shows FFTW with SSE SIMD in- 
structions enabled, which perform multiple arithmetic opera- 
tions at once (see section ); bottom line (circles) shows FFTW 
with SSE disabled, which thus requires a similar number of 
arithmetic instructions to the textbook code. (This is not in- 
tended as a criticism of Numerical Recipes — simple radix-2 
implementations are reasonable for pedagogy — but it illus- 
trates the radical differences between straightforward and op- 
timized implementations of FFT algorithms, even with simi- 
lar arithmetic costs.) For n > 2 19 , the ratio increases because 
the textbook code becomes much slower (this happens when 
the DFT size exceeds the level-2 cache). 



And yet there is a vast gap between this basic mathematical theory 
and the actual practice — highly optimized FFT packages are often 
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an order of magnitude faster than the textbook subroutines, and the 
internal structure to achieve this performance is radically different 
from the typical textbook presentation of the "same" Cooley-Tukey 
algorithm. For example, Figure 11.1 plots the ratio of benchmark 
speeds between a highly optimized FFT [133], [134] and a typical 
textbook radix-2 implementation [290], and the former is faster 
by a factor of 5-40 (with a larger ratio as n grows). Here, we 
will consider some of the reasons for this discrepancy, and some 
techniques that can be used to address the difficulties faced by a 
practical high-performance FFT implementation. 3 

In particular, in this chapter we will discuss some of the lessons 
learned and the strategies adopted in the FFTW library. FFTW 
[133], [134] is a widely used free-software library that computes 
the discrete Fourier transform (DFT) and its various special cases. 
Its performance is competitive even with manufacturer-optimized 
programs [134], and this performance is portable thanks the struc- 
ture of the algorithms employed, self-optimization techniques, 
and highly optimized kernels (FFTW's codelets) generated by a 
special-purpose compiler. 

This chapter is structured as follows. First "Review of the Cooley- 
Tukey FFT" (Section 11.2: Review of the Cooley-Tukey FFT), 
we briefly review the basic ideas behind the Cooley-Tukey algo- 
rithm and define some common terminology, especially focusing 
on the many degrees of freedom that the abstract algorithm al- 
lows to implementations. Next, in "Goals and Background of the 
FFTW Project" (Section 1 1.3: Goals and Background of the FFTW 
Project), we provide some context for FFTW's development and 
stress that performance, while it receives the most publicity, is not 
necessarily the most important consideration in the implementa- 



3 We won't address the question of parallelization on multi-processor ma- 
chines, which adds even greater difficulty to FFT implementation — although 
multi-processors are increasingly important, achieving good serial performance 
is a basic prerequisite for optimized parallel code, and is already hard enough! 
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tion of a library of this sort. Third, in "FFTs and the Memory 
Hierarchy" (Section 11.4: FFTs and the Memory Hierarchy), we 
consider a basic theoretical model of the computer memory hierar- 
chy and its impact on FFT algorithm choices: quite general consid- 
erations push implementations towards large radices and explicitly 
recursive structure. Unfortunately, general considerations are not 
sufficient in themselves, so we will explain in "Adaptive Compo- 
sition of FFT Algorithms" (Section 11.5: Adaptive Composition 
of FFT Algorithms) how FFTW self -optimizes for particular ma- 
chines by selecting its algorithm at runtime from a composition 
of simple algorithmic steps. Furthermore, "Generating Small FFT 
Kernels" (Section 11.6: Generating Small FFT Kernels) describes 
the utility and the principles of automatic code generation used to 
produce the highly optimized building blocks of this composition, 
FFTW's codelets. Finally, we will briefly consider an important 
non-performance issue, in "Numerical Accuracy in FFTs" (Sec- 
tion 1 1.7: Numerical Accuracy in FFTs). 

11.2 Review of the Cooley-Tukey FFT 

The (forward, one-dimensional) discrete Fourier transform (DFT) 
of an array X of n complex numbers is the array Y given by 



Y[fc] = £x[4o>f , (11.1) 



where < k < n and co n — exp(—27ti/n) is a primitive root of 
unity. Implemented directly, (11.1) would require (n 2 ) opera- 
tions; fast Fourier transforms are O (nlogn) algorithms to compute 
the same result. The most important FFT (and the one primarily 
used in FFTW) is known as the "Cooley-Tukey" algorithm, after 
the two authors who rediscovered and popularized it in 1965 [92], 
although it had been previously known as early as 1805 by Gauss 
as well as by later re-inventors [173]. The basic idea behind this 
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FFT is that a DFT of a composite size n — n\ti2 can be re-expressed 
in terms of smaller DFTs of sizes n\ and «2 — essentially, as a two- 
dimensional DFT of size «i x«2 where the output is transposed. 
The choices of factorizations of n, combined with the many differ- 
ent ways to implement the data re-orderings of the transpositions, 
have led to numerous implementation strategies for the Cooley- 
Tukey FFT, with many variants distinguished by their own names 
[114], [389]. FFTW implements a space of many such variants, 
as described in "Adaptive Composition of FFT Algorithms" (Sec- 
tion 1 1.5: Adaptive Composition of FFT Algorithms), but here we 
derive the basic algorithm, identify its key features, and outline 
some important historical variations and their relation to FFTW. 

The Cooley-Tukey algorithm can be derived as follows. If n can 
be factored into n — «i«2> (H-l) can be rewritten by letting I — 

l\ri2 + (-2 and k — k\ + &2"i- We then have: 



Y[*i +* 2 *l] =l" 2 2 Io f fa 1 =d X [^2 + ^2] <^ W* 



where k\^ — 0, ...,«i,2 — 1. Thus, the algorithm computes «2 DFTs 
of size n\ (the inner sum), multiplies the result by the so-called 
[139] twiddle factors (O n 2 ' , and finally computes n\ DFTs of size 
«2 (the outer sum). This decomposition is then continued recur- 
sively. The literature uses the term radix to describe an n\ or ni 
that is bounded (often constant); the small DFT of the radix is tra- 
ditionally called a butterfly. 

Many well-known variations are distinguished by the radix alone. 
A decimation in time (DIT) algorithm uses «2 as the radix, while 
a decimation in frequency (DIF) algorithm uses n\ as the radix. 
If multiple radices are used, e.g. for n composite but not a prime 
power, the algorithm is called mixed radix. A peculiar blending of 
radix 2 and 4 is called split radix, which was proposed to minimize 
the count of arithmetic operations [422], [107], [391], [230], [114] 
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although it has been superseded in this regard [202], [225]. FFTW 
implements both DIT and DIF, is mixed-radix with radices that 
are adapted to the hardware, and often uses much larger radices 
(e.g. radix 32) than were once common. On the other end of the 
scale, a "radix" of roughly ^Jn has been called a four-step FFT al- 
gorithm (or six-step, depending on how many transposes one per- 
forms) [14]; see "FFTs and the Memory Hierarchy" (Section 11.4: 
FFTs and the Memory Hierarchy) for some theoretical and practi- 
cal discussion of this algorithm. 

A key difficulty in implementing the Cooley-Tukey FFT is that the 
n\ dimension corresponds to discontiguous inputs l\ in X but con- 
tiguous outputs k\ in Y, and vice-versa for «2- This is a matrix 
transpose for a single decomposition stage, and the composition of 
all such transpositions is a (mixed-base) digit-reversal permutation 
(or bit-reversal, for radix 2). The resulting necessity of discon- 
tiguous memory access and data re-ordering hinders efficient use 
of hierarchical memory architectures (e.g., caches), so that the op- 
timal execution order of an FFT for given hardware is non-obvious, 
and various approaches have been proposed. 
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Figure 11.2: Schematic of traditional breadth-first (left) vs. 
recursive depth-first (right) ordering for radix-2 FFT of size 
8: the computations for each nested box are completed be- 
fore doing anything else in the surrounding box. Breadth- 
first computation performs all butterflies of a given size at 
once, while depth-first computation completes one subtrans- 
form entirely before moving on to the next (as in the algorithm 
below). 



One ordering distinction is between recursion and iteration. As ex- 
pressed above, the Cooley-Tukey algorithm could be thought of as 
defining a tree of smaller and smaller DFTs, as depicted in Fig- 
ure 11.2; for example, a textbook radix-2 algorithm would divide 
size n into two transforms of size n/2, which are divided into four 
transforms of size n/4, and so on until a base case is reached (in 
principle, size 1). This might naturally suggest a recursive im- 
plementation in which the tree is traversed "depth-first" as in Fig- 
ure 1 1.2(right) and the algorithm of p. ?? — one size n/2 transform 
is solved completely before processing the other one, and so on. 
However, most traditional FFT implementations are non-recursive 
(with rare exceptions [341]) and traverse the tree "breadth-first" 
[389] as in Figure 11.2(left) — in the radix-2 example, they would 
perform n (trivial) size-1 transforms, then n/2 combinations into 
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size-2 transforms, then n/4 combinations into size-4 transforms, 
and so on, thus making log 2 n passes over the whole array. In con- 
trast, as we discuss in "Discussion" (Section 1 1.5.2.6: Discussion), 
FFTW employs an explicitly recursive strategy that encompasses 
both depth-first and breadth-first styles, favoring the former since 
it has some theoretical and practical advantages as discussed in 
"FFTs and the Memory Hierarchy" (Section 11.4: FFTs and the 
Memory Hierarchy). 



Y[0,...,n-1] ^recfft 2(n,X,i ): 
IF n=l THEN 

Y [0] <- X [0] 
ELSE 

Y[0,...,n/2-l]<-ncfft2(n/2,X,2i ) 
Y[n/2,...,n-l]<-i*c//f2(n/2,X + i ,2i ) 
FOR k_\ = TO (n/2)-l DO 
t<-Y[k_l] 

Y[JLl]<-f + fi> _rTk_lY[k_l+n/2] 
Y[k_l + n/2] ^t-co _nk_\Y[k_\ + n/2] 
END FOR 
END IF 

Listing 11.1: A depth-first recursive radix-2 DIT Cooley- 
Tukey FFT to compute a DFT of a power-of-two size n = 2 m . 
The input is an array X of length n with stride t (i.e., the inputs 
are X [£i] for £ = 0, ...,n — 1) and the output is an array Y of 
length n (with stride 1), containing the DFT of X [Equation 1]. 
X + i denotes the array beginning with X[i]. This algorithm 
operates out-of -place, produces in-order output, and does not 
require a separate bit-reversal stage. 
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A second ordering distinction lies in how the digit-reversal is per- 
formed. The classic approach is a single, separate digit-reversal 
pass following or preceding the arithmetic computations; this ap- 
proach is so common and so deeply embedded into FFT lore that 
many practitioners find it difficult to imagine an FFT without an 
explicit bit-reversal stage. Although this pass requires only O («) 
time [207], it can still be non-negligible, especially if the data is 
out-of-cache; moreover, it neglects the possibility that data reorder- 
ing during the transform may improve memory locality. Perhaps 
the oldest alternative is the Stockham auto-sort FFT [367], [389], 
which transforms back and forth between two arrays with each but- 
terfly, transposing one digit each time, and was popular to improve 
contiguity of access for vector computers [372]. Alternatively, an 
explicitly recursive style, as in FFTW, performs the digit-reversal 
implicitly at the "leaves" of its computation when operating out-of- 
place (see section "Discussion" (Section 11.5.2.6: Discussion)). A 
simple example of this style, which computes in-order output using 
an out-of-place radix-2 FFT without explicit bit-reversal, is shown 
in the algorithm of p. ?? [corresponding to Figure 11.2(right)]. 
To operate in-place with 0(1) scratch storage, one can interleave 
small matrix transpositions with the butterflies [195], [375], [297], 
[166], and a related strategy in FFTW [134] is briefly described by 
"Discussion" (Section 11.5.2.6: Discussion). 

Finally, we should mention that there are many FFTs entirely dis- 
tinct from Cooley-Tukey. Three notable such algorithms are the 
prime-factor algorithm for gcd(ni,ri2) — 1 [278], along with 
Rader's [309] and Bluestein's [35], [305], [278] algorithms for 
prime n. FFTW implements the first two in its codelet generator for 
hard-coded n "Generating Small FFT Kernels" (Section 11.6: Gen- 
erating Small FFT Kernels) and the latter two for general prime n 
(sections "Plans for prime sizes" (Section 1 1 .5.2.5: Plans for prime 
sizes) and "Goals and Background of the FFTW Project" (Sec- 
tion 11.3: Goals and Background of the FFTW Project)). There 
is also the Winograd FFT [411], [180], [116], [114], which mini- 
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mizes the number of multiplications at the expense of a large num- 
ber of additions; this trade-off is not beneficial on current proces- 
sors that have specialized hardware multipliers. 

11.3 Goals and Background of the FFTW 
Project 

The FFTW project, begun in 1997 as a side project of the au- 
thors Frigo and Johnson as graduate students at MIT, has gone 
through several major revisions, and as of 2008 consists of more 
than 40,000 lines of code. It is difficult to measure the popularity 
of a free-software package, but (as of 2008) FFTW has been cited 
in over 500 academic papers, is used in hundreds of shipping free 
and proprietary software packages, and the authors have received 
over 10,000 emails from users of the software. Most of this chap- 
ter focuses on performance of FFT implementations, but FFTW 
would probably not be where it is today if that were the only con- 
sideration in its design. One of the key factors in FFTW's success 
seems to have been its flexibility in addition to its performance. In 
fact, FFTW is probably the most flexible DFT library available: 

• FFTW is written in portable C and runs well on many archi- 
tectures and operating systems. 

• FFTW computes DFTs in O (nlogn) time for any length n. 
(Most other DFT implementations are either restricted to a 
subset of sizes or they become (n 2 ) for certain values of n, 
for example when n is prime.) 

• FFTW imposes no restrictions on the rank (dimensionality) 
of multi-dimensional transforms. (Most other implementa- 
tions are limited to one-dimensional, or at most two- and 
three-dimensional data.) 

• FFTW supports multiple and/or strided DFTs; for exam- 
ple, to transform a 3 -component vector field or a portion of 
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a multi-dimensional array. (Most implementations support 
only a single DFT of contiguous data.) 
• FFTW supports DFTs of real data, as well as of real 
symmetric/anti- symmetric data (also called discrete co- 
sine/sine transforms). 

Our design philosophy has been to first define the most general 
reasonable functionality, and then to obtain the highest possible 
performance without sacrificing this generality. In this section, we 
offer a few thoughts about why such flexibility has proved impor- 
tant, and how it came about that FFTW was designed in this way. 

FFTW's generality is partly a consequence of the fact the FFTW 
project was started in response to the needs of a real application 
for one of the authors (a spectral solver for Maxwell's equations 
[204]), which from the beginning had to run on heterogeneous 
hardware. Our initial application required multi-dimensional DFTs 
of three-component vector fields (magnetic fields in electromag- 
netism), and so right away this meant: (i) multi-dimensional FFTs; 
(ii) user-accessible loops of FFTs of discontiguous data; (iii) effi- 
cient support for non-power-of-two sizes (the factor of eight differ- 
ence between nxnxn and 2nx2nx In was too much to tolerate); 
and (iv) saving a factor of two for the common real-input case was 
desirable. That is, the initial requirements already encompassed 
most of the features above, and nothing about this application is 
particularly unusual. 

Even for one-dimensional DFTs, there is a common mispercep- 
tion that one should always choose power-of-two sizes if one cares 
about efficiency. Thanks to FFTW's code generator (described in 
"Generating Small FFT Kernels" (Section 11.6: Generating Small 
FFT Kernels)), we could afford to devote equal optimization ef- 
fort to any n with small factors (2, 3, 5, and 7 are good), instead 
of mostly optimizing powers of two like many high-performance 
FFTs. As a result, to pick a typical example on the 3 GHz 
Core Duo processor of Figure 11.1, n — 3600 — 2 4 • 3 2 • 5 2 and 
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n = 3840 = 2 8 • 3 • 5 both execute faster than n = 4096 = 2 12 . (And 
if there are factors one particularly cares about, one can generate 
code for them too.) 

One initially missing feature was efficient support for large prime 
sizes; the conventional wisdom was that large-prime algorithms 
were mainly of academic interest, since in real applications (in- 
cluding ours) one has enough freedom to choose a highly compos- 
ite transform size. However, the prime-size algorithms are fasci- 
nating, so we implemented Rader's O (nlogn) prime-n algorithm 
[309] purely for fun, including it in FFTW 2.0 (released in 1998) 
as a bonus feature. The response was astonishingly positive — 
even though users are (probably) never forced by their application 
to compute a prime-size DFT, it is rather inconvenient to always 
worry that collecting an unlucky number of data points will slow 
down one's analysis by a factor of a million. The prime-size algo- 
rithms are certainly slower than algorithms for nearby composite 
sizes, but in interactive data-analysis situations the difference be- 
tween 1 ms and 10 ms means little, while educating users to avoid 
large prime factors is hard. 

Another form of flexibility that deserves comment has to do with a 
purely technical aspect of computer software. FFTW's implemen- 
tation involves some unusual language choices internally (the FFT- 
kernel generator, described in "Generating Small FFT Kernels" 
(Section 1 1.6: Generating Small FFT Kernels), is written in Objec- 
tive Caml, a functional language especially suited for compiler-like 
programs), but its user-callable interface is purely in C with lowest- 
common-denominator datatypes (arrays of floating-point values). 
The advantage of this is that FFTW can be (and has been) called 
from almost any other programming language, from Java to Perl 
to Fortran 77. Similar lowest-common-denominator interfaces are 
apparent in many other popular numerical libraries, such as LA- 
PACK [10]. Language preferences arouse strong feelings, but this 
technical constraint means that modern programming dialects are 
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best hidden from view for a numerical library. 

Ultimately, very few scientific-computing applications should have 
performance as their top priority. Flexibility is often far more im- 
portant, because one wants to be limited only by one's imagination, 
rather than by one's software, in the kinds of problems that can be 
studied. 



11.4 FFTs and the Memory Hierarchy 

There are many complexities of computer architectures that impact 
the optimization of FFT implementations, but one of the most per- 
vasive is the memory hierarchy. On any modern general-purpose 
computer, memory is arranged into a hierarchy of storage devices 
with increasing size and decreasing speed: the fastest and small- 
est memory being the CPU registers, then two or three levels of 
cache, then the main-memory RAM, then external storage such as 
hard disks. 4 Most of these levels are managed automatically by the 
hardware to hold the most-recently-used data from the next level 
in the hierarchy. 5 There are many complications, however, such as 
limited cache associativity (which means that certain locations in 
memory cannot be cached simultaneously) and cache lines (which 
optimize the cache for contiguous memory access), which are re- 
viewed in numerous textbooks on computer architectures. In this 
section, we focus on the simplest abstract principles of memory 
hierarchies in order to grasp their fundamental impact on FFTs. 



4 A hard disk is utilized by "out-of-core" FFT algorithms for very large n 
[389], but these algorithms appear to have been largely superseded in practice 
by both the gigabytes of memory now common on personal computers and, for 
extremely large n, by algorithms for distributed-memory parallel computers. 

5 This includes the registers: on current "x86" processors, the user- visible in- 
struction set (with a small number of floating-point registers) is internally trans- 
lated at runtime to RISC-like "/i-ops" with a much larger number of physical 
rename registers that are allocated automatically. 
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Because access to memory is in many cases the slowest part of the 
computer, especially compared to arithmetic, one wishes to load as 
much data as possible in to the faster levels of the hierarchy, and 
then perform as much computation as possible before going back 
to the slower memory devices. This is called temporal locality: if 
a given datum is used more than once, we arrange the computation 
so that these usages occur as close together as possible in time. 

11.4.1 Understanding FFTs with an ideal cache 

To understand temporal-locality strategies at a basic level, in this 
section we will employ an idealized model of a cache in a two-level 
memory hierarchy, as defined in [137]. This ideal cache stores Z 
data items from main memory (e.g. complex numbers for our pur- 
poses): when the processor loads a datum from memory, the access 
is quick if the datum is already in the cache (a cache hit) and slow 
otherwise (a cache miss, which requires the datum to be fetched 
into the cache). When a datum is loaded into the cache, 6 it must 
replace some other datum, and the ideal-cache model assumes that 
the optimal replacement strategy is used [20]: the new datum re- 
places the datum that will not be needed for the longest time in 
the future; in practice, this can be simulated to within a factor of 
two by replacing the least-recently used datum [137], but ideal re- 
placement is much simpler to analyze. Armed with this ideal-cache 
model, we can now understand some basic features of FFT imple- 
mentations that remain essentially true even on real cache archi- 
tectures. In particular, we want to know the cache complexity, the 
number Q (n;Z) of cache misses for an FFT of size n with an ideal 
cache of size Z, and what algorithm choices reduce this complex- 
ity. 



6 More generally, one can assume that a cache line of L consecutive data 
items are loaded into the cache at once, in order to exploit spatial locality. The 
ideal-cache model in this case requires that the cache be tall: Z = £l (L 2 ) [137]. 
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First, consider a textbook radix-2 algorithm, which divides n by 
2 at each stage and operates breadth-first as in Figure 11.2(left), 
performing all butterflies of a given size at a time. If n > Z, then 
each pass over the array incurs («) cache misses to reload the 
data, and there are log 2 n passes, for &(nlog 2 n) cache misses in 
total — no temporal locality at all is exploited! 

One traditional solution to this problem is blocking: the computa- 
tion is divided into maximal blocks that fit into the cache, and the 
computations for each block are completed before moving on to 
the next block. Here, a block of Z numbers can fit into the cache 7 
(not including storage for twiddle factors and so on), and thus the 
natural unit of computation is a sub-FFT of size Z. Since each of 
these blocks involves (ZlogZ) arithmetic operations, and there 
are ®(nlogn) operations overall, there must be (|/og z tt) such 
blocks. More explicitly, one could use a radix-Z Cooley-Tukey al- 
gorithm, breaking n down by factors of Z [or 0(Z)] until a size 
Z is reached: each stage requires n/Z blocks, and there are log z n 
stages, again giving {^log z n) blocks overall. Since each block 
requires Z cache misses to load it into cache, the cache complexity 
Qb of such a blocked algorithm is 

Q b (n;Z) = ®(nlog z n). (11.3) 

In fact, this complexity is rigorously optimal for Cooley-Tukey 
FFT algorithms [184], and immediately points us towards large 
radices (not radix 2!) to exploit caches effectively in FFTs. 

However, there is one shortcoming of any blocked FFT algorithm: 
it is cache aware, meaning that the implementation depends ex- 
plicitly on the cache size Z. The implementation must be mod- 
ified (e.g. changing the radix) to adapt to different machines as 



7 Of course, 0(n) additional storage may be required for twiddle factors, the 
output data (if the FFT is not in-place), and so on, but these only affect the n that 
fits into cache by a constant factor and hence do not impact cache-complexity 
analysis. We won't worry about such constant factors in this section. 
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the cache size changes. Worse, as mentioned above, actual ma- 
chines have multiple levels of cache, and to exploit these one must 
perform multiple levels of blocking, each parameterized by the 
corresponding cache size. In the above example, if there were a 
smaller and faster cache of size z < Z, the size-Z sub-FFTs should 
themselves be performed via radix-z Cooley-Tukey using blocks 
of size z. And so on. There are two paths out of these difficulties: 
one is self-optimization, where the implementation automatically 
adapts itself to the hardware (implicitly including any cache sizes), 
as described in "Adaptive Composition of FFT Algorithms" (Sec- 
tion 11.5: Adaptive Composition of FFT Algorithms); the other 
is to exploit cache-oblivious algorithms. FFTW employs both of 
these techniques. 

The goal of cache-obliviousness is to structure the algorithm so 
that it exploits the cache without having the cache size as a param- 
eter: the same code achieves the same asymptotic cache complex- 
ity regardless of the cache size Z. An optimal cache-oblivious 
algorithm achieves the optimal cache complexity (that is, in an 
asymptotic sense, ignoring constant factors). Remarkably, opti- 
mal cache-oblivious algorithms exist for many problems, such as 
matrix multiplication, sorting, transposition, and FFTs [137]. Not 
all cache-oblivious algorithms are optimal, of course — for exam- 
ple, the textbook radix-2 algorithm discussed above is "pessimal" 
cache-oblivious (its cache complexity is independent of Z because 
it always achieves the worst case!). 

For instance, Figure 11.2(right) and the algorithm of p. ?? shows 
a way to obliviously exploit the cache with a radix-2 Cooley- 
Tukey algorithm, by ordering the computation depth-first rather 
than breadth-first. That is, the DFT of size n is divided into two 
DFTs of size n/2, and one DFT of size n/2 is completely finished 
before doing any computations for the second DFT of size n/2. 
The two subtransforms are then combined using n/2 radix-2 but- 
terflies, which requires a pass over the array and (hence n cache 
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misses if n > Z). This process is repeated recursively until a base- 
case (e.g. size 2) is reached. The cache complexity Qi (n;Z) of 
this algorithm satisfies the recurrence 

n n < Z 

Qi{n;Z) = { ~ . (11.4) 

202 (n/2; Z) + (n) otherwise 

The key property is this: once the recursion reaches a size n <Z, 
the subtransform fits into the cache and no further misses are in- 
curred. The algorithm does not "know" this and continues subdi- 
viding the problem, of course, but all of those further subdivisions 
are in-cache because they are performed in the same depth-first 
branch of the tree. The solution of (1 1 .4) is 

Q 2 (n;Z) = ®(nlog[n/Z}). (11.5) 

This is worse than the theoretical optimum Qt, (n;Z) from (11.3), 
but it is cache-oblivious (Z never entered the algorithm) and ex- 
ploits at least some temporal locality. 8 On the other hand, when it 
is combined with FFTW's self-optimization and larger radices in 
"Adaptive Composition of FFT Algorithms" (Section 11.5: Adap- 
tive Composition of FFT Algorithms), this algorithm actually per- 
forms very well until n becomes extremely large. By itself, how- 
ever, the algorithm of p. ?? must be modified to attain ade- 
quate performance for reasons that have nothing to do with the 
cache. These practical issues are discussed further in "Cache- 
obliviousness in practice" (Section 11.4.2: Cache-obliviousness in 
practice). 

There exists a different recursive FFT that is optimal cache- 
oblivious, however, and that is the radix-y^ "four-step" Cooley- 
Tukey algorithm (again executed recursively, depth-first) [137]. 



8 This advantage of depth-first recursive implementation of the radix-2 FFT 
was pointed out many years ago by Singleton (where the "cache" was core mem- 
ory) [341]. 
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n n<Z 

Qo{n;Z) = { - . (11.6) 

2y/nQ (y/n;Z) + (n) otherwise 

That is, at each stage one performs ^Jn DFTs of size ^Jn (recur- 
sively), then multiplies by the ®(n) twiddle factors (and does a 
matrix transposition to obtain in-order output), then finally per- 
forms another ^Jn DFTs of size y/n. The solution of (11.6) is 
Q (n;Z) — ®(nlog z n), the same as the optimal cache complex- 
ity (11.3)! 

These algorithms illustrate the basic features of most optimal 
cache-oblivious algorithms: they employ a recursive divide-and- 
conquer strategy to subdivide the problem until it fits into cache, at 
which point the subdivision continues but no further cache misses 
are required. Moreover, a cache-oblivious algorithm exploits all 
levels of the cache in the same way, so an optimal cache-oblivious 
algorithm exploits a multi-level cache optimally as well as a two- 
level cache [137]: the multi-level "blocking" is implicit in the re- 
cursion. 

11.4.2 Cache-obliviousness in practice 

Even though the radix-y^ algorithm is optimal cache-oblivious, it 
does not follow that FFT implementation is a solved problem. The 
optimality is only in an asymptotic sense, ignoring constant fac- 
tors, O (n) terms, etcetera, all of which can matter a great deal in 
practice. For small or moderate n, quite different algorithms may 
be superior, as discussed in "Memory strategies in FFTW" (Sec- 
tion 11.4.3: Memory strategies in FFTW). Moreover, real caches 
are inferior to an ideal cache in several ways. The unsurpris- 
ing consequence of all this is that cache-obliviousness, like any 
complexity-based algorithm property, does not absolve one from 
the ordinary process of software optimization. At best, it reduces 
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the amount of memory/cache tuning that one needs to perform, 
structuring the implementation to make further optimization easier 
and more portable. 

Perhaps most importantly, one needs to perform an optimization 
that has almost nothing to do with the caches: the recursion must 
be "coarsened" to amortize the function-call overhead and to en- 
able compiler optimization. For example, the simple pedagogi- 
cal code of the algorithm in p. ?? recurses all the way down to 
n — 1 , and hence there are « 2n function calls in total, so that every 
data point incurs a two-function-call overhead on average. More- 
over, the compiler cannot fully exploit the large register sets and 
instruction-level parallelism of modern processors with an n — 1 
function body. 9 These problems can be effectively erased, how- 
ever, simply by making the base cases larger, e.g. the recursion 
could stop when n — 32 is reached, at which point a highly opti- 
mized hard-coded FFT of that size would be executed. In FFTW, 
we produced this sort of large base-case using a specialized code- 
generation program described in "Generating Small FFT Kernels" 
(Section 11.6: Generating Small FFT Kernels). 

One might get the impression that there is a strict dichotomy that 
divides cache-aware and cache-oblivious algorithms, but the two 
are not mutually exclusive in practice. Given an implementa- 
tion of a cache-oblivious strategy, one can further optimize it for 
the cache characteristics of a particular machine in order to im- 
prove the constant factors. For example, one can tune the radices 
used, the transition point between the radix- ^fn algorithm and the 
bounded-radix algorithm, or other algorithmic choices as described 
in "Memory strategies in FFTW" (Section 11.4.3: Memory strate- 
gies in FFTW). The advantage of starting cache-aware tuning with 



9 In principle, it might be possible for a compiler to automatically coarsen 
the recursion, similar to how compilers can partially unroll loops. We are cur- 
rently unaware of any general -purpose compiler that performs this optimization, 
however. 
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a cache-oblivious approach is that the starting point already ex- 
ploits all levels of the cache to some extent, and one has reason to 
hope that good performance on one machine will be more portable 
to other architectures than for a purely cache-aware "blocking" ap- 
proach. In practice, we have found this combination to be very 
successful with FFTW. 

11.4.3 Memory strategies in FFTW 

The recursive cache-oblivious strategies described above form a 
useful starting point, but FFTW supplements them with a number 
of additional tricks, and also exploits cache-obliviousness in less- 
obvious forms. 



We currently find that the general radix- yn algorithm is beneficial 
only when n becomes very large, on the order of 2 20 ps 10 6 . In 
practice, this means that we use at most a single step of radix-y^ 
(two steps would only be used for n > 2 40 ). The reason for this 
is that the implementation of radix y/n is less efficient than for a 
bounded radix: the latter has the advantage that an entire radix but- 
terfly can be performed in hard-coded loop-free code within local 
variables/registers, including the necessary permutations and twid- 
dle factors. 

Thus, for more moderate n, FFTW uses depth-first recursion with 
a bounded radix, similar in spirit to the algorithm of p. ?? but with 
much larger radices (radix 32 is common) and base cases (size 32 
or 64 is common) as produced by the code generator of "Gener- 
ating Small FFT Kernels" (Section 11.6: Generating Small FFT 
Kernels). The self-optimization described in "Adaptive Composi- 
tion of FFT Algorithms" (Section 11.5: Adaptive Composition of 
FFT Algorithms) allows the choice of radix and the transition to 
the radix- sjn algorithm to be tuned in a cache-aware (but entirely 
automatic) fashion. 
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For small n (including the radix butterflies and the base cases of 
the recursion), hard-coded FFTs (FFTW's codelets) are employed. 
However, this gives rise to an interesting problem: a codelet for 
(e.g.) n — 64 is ~ 2000 lines long, with hundreds of variables 
and over 1000 arithmetic operations that can be executed in many 
orders, so what order should be chosen? The key problem here 
is the efficient use of the CPU registers, which essentially form a 
nearly ideal, fully associative cache. Normally, one relies on the 
compiler for all code scheduling and register allocation, but but 
the compiler needs help with such long blocks of code (indeed, the 
general register-allocation problem is NP-complete). In particular, 
FFTW's generator knows more about the code than the compiler — 
the generator knows it is an FFT, and therefore it can use an op- 
timal cache-oblivious schedule (analogous to the radix-y^ algo- 
rithm) to order the code independent of the number of registers 
[128]. The compiler is then used only for local "cache-aware" tun- 
ing (both for register allocation and the CPU pipeline). 10 As a 
practical matter, one consequence of this scheduler is that FFTW's 
machine-independent codelets are no slower than machine-specific 
codelets generated by an automated search and optimization over 
many possible codelet implementations, as performed by the SPI- 
RAL project [420]. 

(When implementing hard-coded base cases, there is another 
choice because a loop of small transforms is always required. Is 
it better to implement a hard-coded FFT of size 64, for example, 
or an unrolled loop of four size- 16 FFTs, both of which operate 
on the same amount of data? The former should be more efficient 
because it performs more computations with the same amount of 
data, thanks to the logn factor in the FFT's nlogn complexity.) 



10 One practical difficulty is that some "optimizing" compilers will tend to 
greatly re-order the code, destroying FFTW's optimal schedule. With GNU 
gcc, we circumvent this problem by using compiler flags that explicitly disable 
certain stages of the optimizer. 
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In addition, there are many other techniques that FFTW employs 
to supplement the basic recursive strategy, mainly to address the 
fact that cache implementations strongly favor accessing consec- 
utive data — thanks to cache lines, limited associativity, and direct 
mapping using low-order address bits (accessing data at power-of- 
two intervals in memory, which is distressingly common in FFTs, 
is thus especially prone to cache-line conflicts). Unfortunately, the 
known FFT algorithms inherently involve some non-consecutive 
access (whether mixed with the computation or in separate bit- 
reversal/transposition stages). There are many optimizations in 
FFTW to address this. For example, the data for several butter- 
flies at a time can be copied to a small buffer before computing 
and then copied back, where the copies and computations involve 
more consecutive access than doing the computation directly in- 
place. Or, the input data for the subtransform can be copied from 
(discontiguous) input to (contiguous) output before performing the 
subtransform in-place (see "Indirect plans" (Section 1 1.5.2.4: Indi- 
rect plans)), rather than performing the subtransform directly out- 
of-place (as in algorithm 1 (p. ??)). Or, the order of loops can 
be interchanged in order to push the outermost loop from the first 
radix step [the £2 loop in (11.2)] down to the leaves, in order to 
make the input access more consecutive (see "Discussion" (Sec- 
tion 11.5.2.6: Discussion)). Or, the twiddle factors can be com- 
puted using a smaller look-up table (fewer memory loads) at the 
cost of more arithmetic (see "Numerical Accuracy in FFTs" (Sec- 
tion 11.7: Numerical Accuracy in FFTs)). The choice of whether 
to use any of these techniques, which come into play mainly for 
moderate n (2 13 < n < 2 20 ), is made by the self-optimizing planner 
as described in the next section. 
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11.5 Adaptive Composition of FFT Algo- 
rithms 

As alluded to several times already, FFTW implements a wide va- 
riety of FFT algorithms (mostly rearrangements of Cooley-Tukey) 
and selects the "best" algorithm for a given n automatically. In this 
section, we describe how such self-optimization is implemented, 
and especially how FFTW's algorithms are structured as a com- 
position of algorithmic fragments. These techniques in FFTW are 
described in greater detail elsewhere [134], so here we will focus 
only on the essential ideas and the motivations behind them. 

An FFT algorithm in FFTW is a composition of algorithmic steps 
called a plan. The algorithmic steps each solve a certain class of 
problems (either solving the problem directly or recursively break- 
ing it into sub-problems of the same type). The choice of plan for 
a given problem is determined by a planner that selects a compo- 
sition of steps, either by runtime measurements to pick the fastest 
algorithm, or by heuristics, or by loading a pre-computed plan. 
These three pieces: problems, algorithmic steps, and the planner, 
are discussed in the following subsections. 

11.5.1 The problem to be solved 

In early versions of FFTW, the only choice made by the planner 
was the sequence of radices [131], and so each step of the plan took 
a DFT of a given size n, possibly with discontiguous input/output, 
and reduced it (via a radix r) to DFTs of size n/r, which were 
solved recursively. That is, each step solved the following prob- 
lem: given a size n, an input pointer I, an input stride l, an out- 
put pointer O, and an output stride o, it computed the DFT of 
I [h] for < £ < n and stored the result in O [ko] for < k < n. 
However, we soon found that we could not easily express many in- 
teresting algorithms within this framework; for example, in-place 
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(I = O) FFTs that do not require a separate bit-reversal stage [195], 
[375], [297], [166]. It became clear that the key issue was not the 
choice of algorithms, as we had first supposed, but the definition of 
the problem to be solved. Because only problems that can be ex- 
pressed can be solved, the representation of a problem determines 
an outer bound to the space of plans that the planner can explore, 
and therefore it ultimately constrains FFTW's performance. 

The difficulty with our initial (n,I, i,0,o) problem definition was 
that it forced each algorithmic step to address only a single DFT. In 
fact, FFTs break down DFTs into multiple smaller DFTs, and it is 
the combination of these smaller transforms that is best addressed 
by many algorithmic choices, especially to rearrange the order of 
memory accesses between the subtrans forms. Therefore, we rede- 
fined our notion of a problem in FFTW to be not a single DFT, but 
rather a loop of DFTs, and in fact multiple nested loops of DFTs. 
The following sections describe some of the new algorithmic steps 
that such a problem definition enables, but first we will define the 
problem more precisely. 

DFT problems in FFTW are expressed in terms of structures called 
I/O tensors, 11 which in turn are described in terms of ancillary 
structures called I/O dimensions. An I/O dimension d is a triple 
d — (n,l,o), where n is a non-negative integer called the length, 
l is an integer called the input stride, and o is an integer called 
the output stride. An I/O tensor t — {d\,d2,--,d p } is a set of I/O 
dimensions. The non-negative integer p — \t\ is called the rank 
of the I/O tensor. A DFT problem, denoted by rf/f (N,V,I,0), 
consists of two I/O tensors N and V, and of two pointers I and 
O. Informally, this describes |V| nested loops of |N (-dimensional 
DFTs with input data starting at memory location I and output data 
starting at O. 



11 I/O tensors are unrelated to the tensor-product notation used by some other 
authors to describe FFT algorithms [389], [296]. 
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For simplicity, let us consider only one-dimensional DFTs, so that 
N = { (n, l , o) } implies a DFT of length n on input data with stride 
l and output data with stride o, much like in the original FFTW as 
described above. The main new feature is then the addition of zero 
or more "loops" V. More formally, dft (N, {(ft, 1,0)} U V,I,0) 
is recursively defined as a "loop" of n problems: for all < k < 
n, do all computations in dft(N,\,I + k-i,0 + k-o). The case 
of multi-dimensional DFTs is defined more precisely elsewhere 
[134], but essentially each I/O dimension in N gives one dimen- 
sion of the transform. 

We call N the size of the problem. The rank of a problem is de- 
fined to be the rank of its size (i.e., the dimensionality of the DFT). 
Similarly, we call V the vector size of the problem, and the vector 
rank of a problem is correspondingly defined to be the rank of its 
vector size. Intuitively, the vector size can be interpreted as a set 
of "loops" wrapped around a single DFT, and we therefore refer to 
a single I/O dimension of V as a vector loop. (Alternatively, one 
can view the problem as describing a DFT over a |V (-dimensional 
vector space.) The problem does not specify the order of execution 
of these loops, however, and therefore FFTW is free to choose the 
fastest or most convenient order. 

11.5.1.1 DFT problem examples 

A more detailed discussion of the space of problems in FFTW can 
be found in [134] , but a simple understanding can be gained by 
examining a few examples demonstrating that the I/O tensor repre- 
sentation is sufficiently general to cover many situations that arise 
in practice, including some that are not usually considered to be 
instances of the DFT. 

A single one-dimensional DFT of length n, with stride- 1 in- 
put X and output Y, as in (11.1), is denoted by the problem 
dft({(n, 1, 1)}, {},X,Y) (no loops: vector-rank zero). 
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As a more complicated example, suppose we have an ni x «2 
matrix X stored as n\ consecutive blocks of contiguous length- 
«2 rows (this is called row-major format). The in-place DFT 
of all the rows of this matrix would be denoted by the prob- 
lem dft ({(«2> 1) l)}){(fti)ft2,«2)})X,X): alength-ni loop of size- 
«2 contiguous DFTs, where each iteration of the loop offsets 
its input/output data by a stride «2- Conversely, the in-place 
DFT of all the columns of this matrix would be denoted by 
dft({(ni,n2,fi2)},{(n2, 1, 1)},X,X): compared to the previous 
example, N and V are swapped. In the latter case, each DFT 
operates on discontiguous data, and FFTW might well choose to 
interchange the loops: instead of performing a loop of DFTs com- 
puted individually, the subtransforms themselves could act on «2 _ 
component vectors, as described in "The space of plans in FFTW" 
(Section 11.5.2: The space of plans in FFTW). 

A size-1 DFT is simply a copy Y [0] =X[0], and here this can 
also be denoted by N = {} (rank zero, a "zero-dimensional" 
DFT). This allows FFTW's problems to represent many kinds 
of copies and permutations of the data within the same prob- 
lem framework, which is convenient because these sorts of 
operations arise frequently in FFT algorithms. For exam- 
ple, to copy n consecutive numbers from I to O, one would 
use the rank-zero problem dft({},{(n, 1, 1)},I,0). More in- 
terestingly, the in-place transpose of an n\ x n^ matrix X 
stored in row-major format, as described above, is denoted 
by dft ({},{(«!, «2 5 1) , ("2) l>fli)},X,X) (rank zero, vector-rank 
two). 

11.5.2 The space of plans in FFTW 

Here, we describe a subset of the possible plans considered by 
FFTW; while not exhaustive [134], this subset is enough to illus- 
trate the basic structure of FFTW and the necessity of including 
the vector loop(s) in the problem definition to enable several inter- 
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esting algorithms. The plans that we now describe usually perform 
some simple "atomic" operation, and it may not be apparent how 
these operations fit together to actually compute DFTs, or why cer- 
tain operations are useful at all. We shall discuss those matters in 
"Discussion" (Section 11.5.2.6: Discussion). 

Roughly speaking, to solve a general DFT problem, one must per- 
form three tasks. First, one must reduce a problem of arbitrary vec- 
tor rank to a set of loops nested around a problem of vector rank 
0, i.e., a single (possibly multi-dimensional) DFT. Second, one 
must reduce the multi-dimensional DFT to a sequence of of rank- 1 
problems, i.e., one-dimensional DFTs; for simplicity, however, we 
do not consider multi-dimensional DFTs below. Third, one must 
solve the rank-1, vector rank-0 problem by means of some DFT 
algorithm such as Cooley-Tukey. These three steps need not be 
executed in the stated order, however, and in fact, almost every 
permutation and interleaving of these three steps leads to a correct 
DFT plan. The choice of the set of plans explored by the planner is 
critical for the usability of the FFTW system: the set must be large 
enough to contain the fastest possible plans, but it must be small 
enough to keep the planning time acceptable. 

11.5.2.1 Rank-0 plans 

The rank-0 problem dft ({}, V, I, O) denotes a permutation of the 
input array into the output array. FFTW does not solve arbitrary 
rank-0 problems, only the following two special cases that arise in 
practice. 

• When |V| = 1 and I/O, FFTW produces a plan that copies 
the input array into the output array. Depending on the 
strides, the plan consists of a loop or, possibly, of a call to 
the ANSI C function memcpy, which is specialized to copy 
contiguous regions of memory. 

• When |V| — 2, I — O, and the strides denote a matrix- 
transposition problem, FFTW creates a plan that transposes 
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the array in-place. FFTW implements the square transposi- 
tion dft({},{(n,i,o) , (n,0,i)},I,O) by means of the cache- 
oblivious algorithm from [137], which is fast and, in theory, 
uses the cache optimally regardless of the cache size (using 
principles similar to those described in the section "FFTs and 
the Memory Hierarchy" (Section 11.4: FFTs and the Mem- 
ory Hierarchy)). A generalization of this idea is employed 
for non-square transpositions with a large common factor or 
a small difference between the dimensions, adapting algo- 
rithms from [100]. 



11.5.2.2 Rank-1 plans 

Rank-1 DFT problems denote ordinary one-dimensional Fourier 
transforms. FFTW deals with most rank-1 problems as follows. 

11.5.2.2.1 Direct plans 

When the DFT rank-1 problem is "small enough" (usually, n < 64), 
FFTW produces a direct plan that solves the problem directly. 
These plans operate by calling a fragment of C code (a codelet) 
specialized to solve problems of one particular size, whose genera- 
tion is described in "Generating Small FFT Kernels" (Section 11.6: 
Generating Small FFT Kernels). More precisely, the codelets com- 
pute a loop (|V| < 1) of small DFTs. 

11.5.2.2.2 Cooley-Tukey plans 

For problems of the form dft({(n,l,o)}, V,I,0) where n — rm, 
FFTW generates a plan that implements a radix-r Cooley-Tukey 
algorithm "Review of the Cooley-Tukey FFT" (Section 11.2: Re- 
view of the Cooley-Tukey FFT). Both decimation-in-time and 
decimation-in-frequency plans are supported, with both small fixed 



1 84 CHAPTER 1 1 . IMPLEMENTING FFTS IN 

PRACTICE 

radices (usually, r < 64) produced by the codelet generator "Gen- 
erating Small FFT Kernels" (Section 11.6: Generating Small FFT 
Kernels) and also arbitrary radices (e.g. radix-y^n). 

The most common case is a decimation in time (DIT) 

plan, corresponding to a radix r — «2 ( ana " thus m — 
n\) in the notation of "Review of the Cooley-Tukey FFT" 
(Section 11.2: Review of the Cooley-Tukey FFT): it first 
solves dft({(m,r-i,o)},\U{(r,i,m-o)},I,0), then multiplies 
the output array O by the twiddle factors, and finally solves 
dft({(r,m-o,m-o)},VU {(m,o,o)},0,0). For performance, the 
last two steps are not planned independently, but are fused together 
in a single "twiddle" codelet — a fragment of C code that multiplies 
its input by the twiddle factors and performs a DFT of size r, oper- 
ating in-place on O. 

11.5.2.3 Plans for higher vector ranks 

These plans extract a vector loop to reduce a DFT problem to a 
problem of lower vector rank, which is then solved recursively. 
Any of the vector loops of V could be extracted in this way, lead- 
ing to a number of possible plans corresponding to different loop 
ordering s. 

Formally, to solve dft(N,\,I,0), where V = {(n,l,o)} U Vi, 
FFTW generates a loop that, for all k such that < k < n, invokes 
a plan for dft (N, Vi , I + k ■ l , O + k ■ o) . 

11.5.2.4 Indirect plans 

Indirect plans transform a DFT problem that requires some data 
shuffling (or discontiguous operation) into a problem that requires 
no shuffling plus a rank-0 problem that performs the shuffling. 

Formally, to solve dft(N,\,I,0) where |N| > 0, FFTW gener- 
ates a plan that first solves dft({},NU\,l,0), and then solves 
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dft (copy — o (N) , copy — o (V) , O, O) . Here we define copy — 
o(t) to be the I/O tensor {(«, 0,0) | (n,i,o) E t}: that is, it replaces 
the input strides with the output strides. Thus, an indirect plan first 
rearranges/copies the data to the output, then solves the problem in 
place. 

11.5.2.5 Plans for prime sizes 

As discussed in "Goals and Background of the FFTW Project" 
(Section 11.3: Goals and Background of the FFTW Project), it 
turns out to be surprisingly useful to be able to handle large prime 
n (or large prime factors). Rader plans implement the algorithm 
from [309] to compute one-dimensional DFTs of prime size in 
(nlogn) time. Bluestein plans implement Bluestein's "chirp-z" 
algorithm, which can also handle prime n in (nlogn) time [35], 
[305], [278]. Generic plans implement a naive (n 2 ) algorithm 
(useful for n < 100). 

11.5.2.6 Discussion 

Although it may not be immediately apparent, the combination 
of the recursive rules in "The space of plans in FFTW" (Sec- 
tion 11.5.2: The space of plans in FFTW) can produce a number 
of useful algorithms. To illustrate these compositions, we discuss 
three particular issues: depth- vs. breadth-first, loop reordering, 
and in-place transforms. 
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size-30 DFT, depth-first,: 

f loop 3 

< J size- 5 direct, codelet,, vector size 2 

I [ size- 2 twiddle codelet. vector size 5 
size- 3 twiddle codelet. vector size 10 

size-30 DFT, breadth-first: 
f f loop 3 

size-o direct, codelet, vector size 2 
loop 3 

size-2 twiddle codelet, vector size 5 
size- 3 twiddle codelet. vector size 10 



Figure 11.3: Two possible decompositions for a size-30 DFT, 
both for the arbitrary choice of DIT radices 3 then 2 then 
5, and prime-size codelets. Items grouped by a "{" result 
from the plan for a single sub-problem. In the depth-first 
case, the vector rank was reduced to zero as per "Plans for 
higher vector ranks" (Section 11.5.2.3: Plans for higher vec- 
tor ranks) before decomposing sub-problems, and vice- versa 
in the breadth-first case. 



As discussed previously in sections "Review of the Cooley-Tukey 
FFT" (Section 11.2: Review of the Cooley-Tukey FFT) and "Un- 
derstanding FFTs with an ideal cache" (Section 11.4.1: Under- 
standing FFTs with an ideal cache), the same Cooley-Tukey de- 
composition can be executed in either traditional breadth-first or- 
der or in recursive depth-first order, where the latter has some 
theoretical cache advantages. FFTW is explicitly recursive, and 
thus it can naturally employ a depth-first order. Because its sub- 
problems contain a vector loop that can be executed in a variety of 
orders, however, FFTW can also employ breadth-first traversal. In 
particular, a Id algorithm resembling the traditional breadth-first 
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Cooley-Tukey would result from applying "Cooley-Tukey plans" 
(Section 11.5.2.2.2: Cooley-Tukey plans) to completely factorize 
the problem size before applying the loop rule "Plans for higher 
vector ranks" (Section 11.5.2.3: Plans for higher vector ranks) to 
reduce the vector ranks, whereas depth-first traversal would result 
from applying the loop rule before factorizing each subtransform. 
These two possibilities are illustrated by an example in Figure 1 1 .3. 

Another example of the effect of loop reordering is a style of plan 
that we sometimes call vector recursion (unrelated to "vector- 
radix" FFTs [114]). The basic idea is that, if one has a loop 
(vector-rank 1) of transforms, where the vector stride is smaller 
than the transform size, it is advantageous to push the loop towards 
the leaves of the transform decomposition, while otherwise main- 
taining recursive depth-first ordering, rather than looping "outside" 
the transform; i.e., apply the usual FFT to "vectors" rather than 
numbers. Limited forms of this idea have appeared for computing 
multiple FFTs on vector processors (where the loop in question 
maps directly to a hardware vector) [372]. For example, Cooley- 
Tukey produces a unit input-stride vector loop at the top-level DIT 
decomposition, but with a large output stride; this difference in 
strides makes it non-obvious whether vector recursion is advanta- 
geous for the sub-problem, but for large transforms we often ob- 
serve the planner to choose this possibility. 

In-place Id transforms (with no separate bit reversal pass) can be 
obtained as follows by a combination DIT and DIF plans "Cooley- 
Tukey plans" (Section 1 1.5.2.2.2: Cooley-Tukey plans) with trans- 
poses "Rank-0 plans" (Section 11.5.2.1: Rank-0 plans). First, the 
transform is decomposed via a radix-/? DIT plan into a vector of 
p transforms of size qm, then these are decomposed in turn by a 
radix-g DIF plan into a vector (rank 2) of p x q transforms of size 
m. These transforms of size m have input and output at differ- 
ent places/strides in the original array, and so cannot be solved 
independently. Instead, an indirect plan "Indirect plans" (Sec- 
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tion 11.5.2.4: Indirect plans) is used to express the sub-problem 
as pq in-place transforms of size m, followed or preceded by an 
mx pxq rank-0 transform. The latter sub-problem is easily seen to 
be m in-place p x q transposes (ideally square, i.e. p — q). Related 
strategies for in-place transforms based on small transposes were 
described in [195], [375], [297], [166]; alternating DIT/DIF, with- 
out concern for in-place operation, was also considered in [255], 
[322]. 

11.5.3 The FFTW planner 

Given a problem and a set of possible plans, the basic principle 
behind the FFTW planner is straightforward: construct a plan for 
each applicable algorithmic step, time the execution of these plans, 
and select the fastest one. Each algorithmic step may break the 
problem into subproblems, and the fastest plan for each subprob- 
lem is constructed in the same way. These timing measurements 
can either be performed at runtime, or alternatively the plans for a 
given set of sizes can be precomputed and loaded at a later time. 

A direct implementation of this approach, however, faces an expo- 
nential explosion of the number of possible plans, and hence of the 
planning time, as n increases. In order to reduce the planning time 
to a manageable level, we employ several heuristics to reduce the 
space of possible plans that must be compared. The most important 
of these heuristics is dynamic programming [96]: it optimizes 
each sub-problem locally, independently of the larger context (so 
that the "best" plan for a given sub-problem is re-used whenever 
that sub-problem is encountered). Dynamic programming is not 
guaranteed to find the fastest plan, because the performance of 
plans is context-dependent on real machines (e.g., the contents of 
the cache depend on the preceding computations); however, this 
approximation works reasonably well in practice and greatly re- 
duces the planning time. Other approximations, such as restric- 
tions on the types of loop-reorderings that are considered "Plans 
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for higher vector ranks" (Section 11.5.2.3: Plans for higher vector 
ranks), are described in [134]. 

Alternatively, there is an estimate mode that performs no timing 
measurements whatsoever, but instead minimizes a heuristic cost 
function. This can reduce the planner time by several orders of 
magnitude, but with a significant penalty observed in plan effi- 
ciency; e.g., a penalty of 20% is typical for moderate n < 2 13 , 
whereas a factor of 2-3 can be suffered for large n > 2 16 [134]. 
Coming up with a better heuristic plan is an interesting open re- 
search question; one difficulty is that, because FFT algorithms de- 
pend on factorization, knowing a good plan for n does not imme- 
diately help one find a good plan for nearby n. 

11.6 Generating Small FFT Kernels 

The base cases of FFTW's recursive plans are its codelets, and 
these form a critical component of FFTW's performance. They 
consist of long blocks of highly optimized, straight-line code, im- 
plementing many special cases of the DFT that give the planner a 
large space of plans in which to optimize. Not only was it imprac- 
tical to write numerous codelets by hand, but we also needed to 
rewrite them many times in order to explore different algorithms 
and optimizations. Thus, we designed a special-purpose "FFT 
compiler" called genfft that produces the codelets automatically 
from an abstract description, genfft is summarized in this section 
and described in more detail by [128]. 

A typical codelet in FFTW computes a DFT of a small, fixed size 
n (usually, n < 64), possibly with the input or output multiplied by 
twiddle factors "Cooley-Tukey plans" (Section 11.5.2.2.2: Cooley- 
Tukey plans). Several other kinds of codelets can be produced by 
genfft , but we will focus here on this common case. 

In principle, all codelets implement some combination of the 
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Cooley-Tukey algorithm from (11.2) and/or some other DFT al- 
gorithm expressed by a similarly compact formula. However, a 
high-performance implementation of the DFT must address many 
more concerns than (11.2) alone suggests. For example, (11.2) 
contains multiplications by 1 that are more efficient to omit. (11.2) 
entails a run-time factorization of n, which can be precomputed if 
n is known in advance. (11.2) operates on complex numbers, but 
breaking the complex-number abstraction into real and imaginary 
components turns out to expose certain non-obvious optimizations. 
Additionally, to exploit the long pipelines in current processors, the 
recursion implicit in (11.2) should be unrolled and re-ordered to a 
significant degree. Many further optimizations are possible if the 
complex input is known in advance to be purely real (or imagi- 
nary). Our design goal for genfft was to keep the expression of 
the DFT algorithm independent of such concerns. This separation 
allowed us to experiment with various DFT algorithms and im- 
plementation strategies independently and without (much) tedious 
rewriting. 

genfft is structured as a compiler whose input consists of the kind 
and size of the desired codelet, and whose output is C code, genfft 
operates in four phases: creation, simplification, scheduling, and 
unparsing. 

In the creation phase, genfft produces a representation of the 
codelet in the form of a directed acyclic graph (dag). The dag 
is produced according to well-known DFT algorithms: Cooley- 
Tukey (11.2), prime-factor [278], split-radix [422], [107], [391], 
[230], [114], and Rader [309]. Each algorithm is expressed in a 
straightforward math-like notation, using complex numbers, with 
no attempt at optimization. Unlike a normal FFT implementation, 
however, the algorithms here are evaluated symbolically and the 
resulting symbolic expression is represented as a dag, and in par- 
ticular it can be viewed as a linear network [98] (in which the 
edges represent multiplication by constants and the vertices repre- 
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sent additions of the incoming edges). 

In the simplification phase, genfft applies local rewriting rules to 
each node of the dag in order to simplify it. This phase performs al- 
gebraic transformations (such as eliminating multiplications by 1) 
and common-subexpression elimination. Although such transfor- 
mations can be performed by a conventional compiler to some de- 
gree, they can be carried out here to a greater extent because genfft 
can exploit the specific problem domain. For example, two equiv- 
alent subexpressions can always be detected, even if the subex- 
pressions are written in algebraically different forms, because all 
subexpressions compute linear functions. Also, genfft can exploit 
the property that network transposition (reversing the direction of 
every edge) computes the transposed linear operation [98], in or- 
der to transpose the network, simplify, and then transpose back — 
this turns out to expose additional common subexpressions [128]. 
In total, these simplifications are sufficiently powerful to derive 
DFT algorithms specialized for real and/or symmetric data auto- 
matically from the complex algorithms. For example, it is known 
that when the input of a DFT is real (and the output is hence 
conjugate-symmetric), one can save a little over a factor of two 
in arithmetic cost by specializing FFT algorithms for this case — 
with genfft , this specialization can be done entirely automatically, 
pruning the redundant operations from the dag, to match the low- 
est known operation count for a real-input FFT starting only from 
the complex-data algorithm [128], [202]. We take advantage of 
this property to help us implement real-data DFTs [128], [134], 
to exploit machine-specific "SIMD" instructions "SIMD instruc- 
tions" (Section 11.6.1: SIMD instructions) [134], and to generate 
codelets for the discrete cosine (DCT) and sine (DST) transforms 
[128], [202]. Furthermore, by experimentation we have discov- 
ered additional simplifications that improve the speed of the gener- 
ated code. One interesting example is the elimination of negative 
constants [128]: multiplicative constants in FFT algorithms often 
come in positive/negative pairs, but every C compiler we are aware 
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of will generate separate load instructions for positive and nega- 
tive versions of the same constants. 12 We thus obtained a 10-15% 
speedup by making all constants positive, which involves propa- 
gating minus signs to change additions into subtractions or vice 
versa elsewhere in the dag (a daunting task if it had to be done 
manually for tens of thousands of lines of code). 

In the scheduling phase, genfft produces a topological sort of the 
dag (a schedule). The goal of this phase is to find a schedule such 
that a C compiler can subsequently perform a good register allo- 
cation. The scheduling algorithm used by genfft offers certain the- 
oretical guarantees because it has its foundations in the theory of 
cache-oblivious algorithms [137] (here, the registers are viewed as 
a form of cache), as described in "Memory strategies in FFTW" 
(Section 1 1.4.3: Memory strategies in FFTW). As a practical mat- 
ter, one consequence of this scheduler is that FFTW's machine- 
independent codelets are no slower than machine- specific codelets 
generated by SPIRAL [420]. 

In the stock genfft implementation, the schedule is finally unparsed 
to C. A variation from [127] implements the rest of a compiler back 
end and outputs assembly code. 

11.6.1 SIMD instructions 

Unfortunately, it is impossible to attain nearly peak performance 
on current popular processors while using only portable C code. 
Instead, a significant portion of the available computing power can 
only be accessed by using specialized SIMD (single-instruction 
multiple data) instructions, which perform the same operation in 
parallel on a data vector. For example, all modern "x86" proces- 
sors can execute arithmetic instructions on "vectors" of four single- 



12 Floating-point constants must be stored explicitly in memory; they cannot 
be embedded directly into the CPU instructions like integer "immediate" con- 
stants. 
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precision values (SSE instructions) or two double-precision values 
(SSE2 instructions) at a time, assuming that the operands are ar- 
ranged consecutively in memory and satisfy a 16-byte alignment 
constraint. Fortunately, because nearly all of FFTW's low-level 
code is produced by genfft , machine- specific instructions could 
be exploited by modifying the generator — the improvements are 
then automatically propagated to all of FFTW's codelets, and in 
particular are not limited to a small set of sizes such as powers of 
two. 

SIMD instructions are superficially similar to "vector processors", 
which are designed to perform the same operation in parallel on an 
all elements of a data array (a "vector"). The performance of "tra- 
ditional" vector processors was best for long vectors that are stored 
in contiguous memory locations, and special algorithms were de- 
veloped to implement the DFT efficiently on this kind of hardware 
[372], [166]. Unlike in vector processors, however, the SIMD vec- 
tor length is small and fixed (usually 2 or 4). Because micropro- 
cessors depend on caches for performance, one cannot naively use 
SIMD instructions to simulate a long-vector algorithm: while on 
vector machines long vectors generally yield better performance, 
the performance of a microprocessor drops as soon as the data 
vectors exceed the capacity of the cache. Consequently, SIMD 
instructions are better seen as a restricted form of instruction-level 
parallelism than as a degenerate flavor of vector parallelism, and 
different DFT algorithms are required. 

The technique used to exploit SIMD instructions in genfft is most 
easily understood for vectors of length two (e.g., SSE2). In this 
case, we view a complex DFT as a pair of real DFTs: 

DFT (A + i-B) = DFT (A) + i ■ DFT (5) , (1 1.7) 

where A and B are two real arrays. Our algorithm computes 
the two real DFTs in parallel using SIMD instructions, and then 
it combines the two outputs according to (11.7). This SIMD al- 
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gorithm has two important properties. First, if the data is stored 
as an array of complex numbers, as opposed to two separate real 
and imaginary arrays, the SIMD loads and stores always operate 
on correctly-aligned contiguous locations, even if the the complex 
numbers themselves have a non-unit stride. Second, because the al- 
gorithm finds two-way parallelism in the real and imaginary parts 
of a single DFT (as opposed to performing two DFTs in parallel), 
we can completely parallelize DFTs of any size, not just even sizes 
or powers of 2. 

11.7 Numerical Accuracy in FFTs 

An important consideration in the implementation of any practi- 
cal numerical algorithm is numerical accuracy: how quickly do 
floating-point roundoff errors accumulate in the course of the com- 
putation? Fortunately, FFT algorithms for the most part have re- 
markably good accuracy characteristics. In particular, for a DFT 
of length n computed by a Cooley-Tukey algorithm with finite- 
precision floating-point arithmetic, the worst-case error growth is 
O (logn) [139], [373] and the mean error growth for random inputs 
is only O (y/Togn) [326], [373]. This is so good that, in practical 
applications, a properly implemented FFT will rarely be a signifi- 
cant contributor to the numerical error. 

The amazingly small roundoff errors of FFT algorithms are some- 
times explained incorrectly as simply a consequence of the reduced 
number of operations: since there are fewer operations compared 
to a naive O (n 2 ) algorithm, the argument goes, there is less ac- 
cumulation of roundoff error. The real reason, however, is more 
subtle than that, and has to do with the ordering of the operations 
rather than their number. For example, consider the computation 
of only the output Y [0] in the radix-2 algorithm of p. ??, ignoring 
all of the other outputs of the FFT. Y [0] is the sum of all of the 
inputs, requiring n—\ additions. The FFT does not change this 
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requirement, it merely changes the order of the additions so as to 
re-use some of them for other outputs. In particular, this radix-2 
DIT FFT computes Y [0] as follows: it first sums the even-indexed 
inputs, then sums the odd-indexed inputs, then adds the two sums; 
the even- and odd-indexed inputs are summed recursively by the 
same procedure. This process is sometimes called cascade sum- 
mation, and even though it still requires n—\ total additions to 
compute Y [0] by itself, its roundoff error grows much more slowly 
than simply adding X [0], X [1], X [2] and so on in sequence. Specif- 
ically, the roundoff error when adding up n floating-point num- 
bers in sequence grows as 0(n) in the worst case, or as 0(\fn) 
on average for random inputs (where the errors grow according 
to a random walk), but simply reordering these n-1 additions into 
a cascade summation yields O(logn) worst-case and O (\/logn) 
average-case error growth [182]. 

However, these encouraging error-growth rates only apply if the 
trigonometric "twiddle" factors in the FFT algorithm are computed 
very accurately. Many FFT implementations, including FFTW 
and common manufacturer-optimized libraries, therefore use pre- 
computed tables of twiddle factors calculated by means of stan- 
dard library functions (which compute trigonometric constants to 
roughly machine precision). The other common method to com- 
pute twiddle factors is to use a trigonometric recurrence formula — 
this saves memory (and cache), but almost all recurrences have 
errors that grow as 0{\/n), 0(n), or even O (n 2 ) [374], which 
lead to corresponding errors in the FFT. For example, one sim- 
ple recurrence is e'( fc+1 ) e = e' ke e ie , multiplying repeatedly by e' e 
to obtain a sequence of equally spaced angles, but the errors when 
using this process grow as O (n) [374]. A common improved recur- 
rence is e l( - k+ ^ e — e' k0 + e' ke (e' e - l) , where the small quantity 13 
e 10 — 1 = cos (6) — 1 +isin(0) is computed using cos (6) — 1 = 



13 In an FFT, the twiddle factors are powers of <a n , so is a small angle pro- 
portional to 1 jn and e' 9 is close to 1 . 
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—2sin 2 (0/2) [341]; unfortunately, the error using this method still 
grows as O (y/n) [374], far worse than logarithmic. 



There are, in fact, trigonometric recurrences with the same loga- 
rithmic error growth as the FFT, but these seem more difficult to 
implement efficiently; they require that a table of (logn) values 
be stored and updated as the recurrence progresses [42], [374]. In- 
stead, in order to gain at least some of the benefits of a trigono- 
metric recurrence (reduced memory pressure at the expense of 
more arithmetic), FFTW includes several ways to compute a much 
smaller twiddle table, from which the desired entries can be com- 
puted accurately on the fly using a bounded number (usually < 3) 
of complex multiplications. For example, instead of a twiddle table 
with n entries 0)^, FFTW can use two tables with ®(y/n) entries 
each, so that Q)„ is computed by multiplying an entry in one table 
(indexed with the low-order bits of k) by an entry in the other table 
(indexed with the high-order bits of k). 

There are a few non-Cooley-Tukey algorithms that are known to 
have worse error characteristics, such as the "real-factor" algo- 
rithm [313], [114], but these are rarely used in practice (and are 
not used at all in FFTW). On the other hand, some commonly used 
algorithms for type-I and type-IV discrete cosine transforms [372], 
[290], [73] have errors that we observed to grow as ^Jn even for ac- 
curate trigonometric constants (although we are not aware of any 
theoretical error analysis of these algorithms), and thus we were 
forced to use alternative algorithms [134]. 

To measure the accuracy of FFTW, we compare against a slow 
FFT implemented in arbitrary-precision arithmetic, while to verify 
the correctness we have found the O (nlogn) self-test algorithm of 
[122] very useful. 
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11.8 Concluding Remarks 

It is unlikely that many readers of this chapter will ever have to 
implement their own fast Fourier transform software, except as a 
learning exercise. The computation of the DFT, much like basic 
linear algebra or integration of ordinary differential equations, is 
so central to numerical computing and so well-established that ro- 
bust, flexible, highly optimized libraries are widely available, for 
the most part as free/open-source software. And yet there are many 
other problems for which the algorithms are not so finalized, or for 
which algorithms are published but the implementations are un- 
available or of poor quality. Whatever new problems one comes 
across, there is a good chance that the chasm between theory and 
efficient implementation will be just as large as it is for FFTs, un- 
less computers become much simpler in the future. For readers 
who encounter such a problem, we hope that these lessons from 
FFTW will be useful: 

• Generality and portability should almost always come first. 

• The number of operations, up to a constant factor, is less 
important than the order of operations. 

• Recursive algorithms with large base cases make optimiza- 
tion easier. 

• Optimization, like any tedious task, is best automated. 

• Code generation reconciles high-level programming with 
low-level performance. 

We should also mention one final lesson that we haven't discussed 
in this chapter: you can't optimize in a vacuum, or you end up 
congratulating yourself for making a slow program slightly faster. 
We started the FFTW project after downloading a dozen FFT im- 
plementations, benchmarking them on a few machines, and noting 
how the winners varied between machines and between transform 
sizes. Throughout FFTW's development, we continued to benefit 
from repeated benchmarks against the dozens of high-quality FFT 
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programs available online, without which we would have thought 
FFTW was "complete" long ago. 
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Chapter 12 

Algorithms for Data with 
Restrictions 

12.1 Algorithms for Real Data 

Many applications involve processing real data. It is inefficient to 
simply use a complex FFT on real data because arithmetic would 
be performed on the zero imaginary parts of the input, and, be- 
cause of symmetries, output values would be calculated that are 
redundant. There are several approaches to developing special al- 
gorithms or to modifying complex algorithms for real data. 

There are two methods which use a complex FFT in a special way 
to increase efficiency [39], [359]. The first method uses a length-N 
complex FFT to compute two length-N real FFTs by putting the 
two real data sequences into the real and the imaginary parts of 
the input to a complex FFT. Because transforms of real data have 
even real parts and odd imaginary parts, it is possible to separate 
the transforms of the two inputs with 2N-4 extra additions. This 
method requires, however, that two inputs be available at the same 
time. 



lr rhis content is available online at <http://cnx.org/content/ml6338/L7/>. 



199 



20Q CHAPTER 12. ALGORITHMS FOR 

DATA WITH RESTRICTIONS 

The second method [359] uses the fact that the last stage of a 
decimation-in-time radix-2 FFT combines two independent trans- 
forms of length N/2 to compute a length-N transform. If the 
data are real, the two half length transforms are calculated by the 
method described above and the last stage is carried out to calculate 
the total length-N FFT of the real data. It should be noted that the 
half-length FFT does not have to be calculated by a radix-2 FFT. 
In fact, it should be calculated by the most efficient complex-data 
algorithm possible, such as the SRFFT or the PFA. The separa- 
tion of the two half-length transforms and the computation of the 
last stage requires N — 6 real multiplications and (5/2) N — 6 real 
additions [359]. 

It is possible to derive more efficient real-data algorithms directly 
rather than using a complex FFT. The basic idea is from Bergland 
[21], [22] and Sande [325] which, at each stage, uses the symme- 
tries of a constant radix Cooley-Tukey FFT to minimize arithmetic 
and storage. In the usual derivation [275] of the radix-2 FFT, the 
length-N transform is written as the combination of the length-N/2 
DFT of the even indexed data and the length-N/2 DFT of the odd 
indexed data. If the input to each half-length DFT is real, the out- 
put will have Hermitian symmetry. Hence the output of each stage 
can be arranged so that the results of that stage stores the complex 
DFT with the real part located where half of the DFT would have 
gone, and the imaginary part located where the conjugate would 
have gone. This removes most of the redundant calculations and 
storage but slightly complicates the addressing. The resulting but- 
terfly structure for this algorithm [359] resembles that for the fast 
Hartley transform [353]. The complete algorithm has one half the 
number of multiplications and N-2 fewer than half the additions of 
the basic complex FFT. Applying this approach to the split-radix 
FFT gives a particularly interesting algorithm [103], [359], [111]. 

Special versions of both the PFA and WFTA can also be developed 
for real data. Because the operations in the stages of the PFA can be 
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commuted, it is possible to move the combination of the transform 
of the real part of the input and imaginary part to the last stage. Be- 
cause the imaginary part of the input is zero, half of the algorithm 
is simply omitted. This results in the number of multiplications 
required for the real transform being exactly half of that required 
for complex data and the number of additions being about N less 
than half that required for the complex case because adding a pure 
real number to a pure imaginary number does not require an actual 
addition. Unfortunately, the indexing and data transfer becomes 
somewhat more complicated [179], [359]. A similar approach can 
be taken with the WFTA [179], [359], [284]. 

12.2 Special Algorithms for input Data that 
is mostly Zero, for Calculating only a few 
Outputs, or where the Sampling is not Uni- 
form 

In some cases, most of the data to be transformed are zero. It is 
clearly wasteful to do arithmetic on that zero data. Another special 
case is when only a few DFT values are needed. It is likewise 
wasteful to calculate outputs that are not needed. We use a process 
called "pruning" to remove the unneeded operations. 

In other cases, the data are non-uniform sampling of a continuous 
time signal [13]. 

12.3 Algorithms for Approximate DFTs 

There are applications where approximations to the DFT are all 
thatisneeded.[161], [163] 
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Chapter 13 
Convolution Algorithms 1 



13.1 Fast Convolution by the FFT 

One of the main applications of the FFT is to do convolution more 
efficiently than the direct calculation from the definition which is: 

y( n )— y\ h(m) x{n — m) (13.1) 

which, with a change of variables, can also be written as: 

y(n) — V x{m) h{n — m) (13.2) 

This is often used to filter a signal x (n) with a filter whose im- 
pulse response is h (n). Each output value y (n) requires N multi- 
plications and N —\ additions if y (n) and h (n) have ,/V terms. So, 
for ,/V output values, on the order of ,/V 2 arithmetic operations are 
required. 

Because the DFT converts convolution to multiplication: 

DFT{y(n)} = DFT{h(n)} DFT{x(n)} (13.3) 



This content is available online at <http://cnx.org/content/ml6339/L10/>. 
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can be calculated with the FFT and bring the order of arithmetic 
operations down to Nlog (N) which can be significant for large N. 

This approach, which is called "fast convolutions", is a form of 
block processing since a whole block or segment of x (ft) must be 
available to calculate even one output value, y (ft) . So, a time de- 
lay of one block length is always required. Another problem is the 
filtering use of convolution is usually non-cyclic and the convo- 
lution implemented with the DFT is cyclic. This is dealt with by 
appending zeros to x (ft) and h (ft) such that the output of the cyclic 
convolution gives one block of the output of the desired non-cyclic 
convolution. 

For filtering and some other applications, one wants "on going" 
convolution where the filter response h (ft) may be finite in length 
or duration, but the input x (ft) is of arbitrary length. Two methods 
have traditionally used to break the input into blocks and use the 
FFT to convolve the block so that the output that would have been 
calculated by directly implementing (13.1) or (13.2) can be con- 
structed efficiently. These are called "overlap-add" and "over-lap 
save". 

13.1.1 Fast Convolution by Overlap-Add 

In order to use the FFT to convolve (or filter) a long input sequence 
jc(ft) with a finite length-M impulse response, ft (ft), we partition 
the input sequence in segments or blocks of length L. Because 
convolution (or filtering) is linear, the output is a linear sum of 
the result of convolving the first block with h (ft) plus the result of 
convolving the second block with h (n), plus the rest. Each of these 
block convolutions can be calculated by using the FFT. The output 
is the inverse FFT of the product of the FFT of x («) and the FFT 
of h (ft). Since the number of arithmetic operation to calculate the 
convolution directly is on the order of M 2 and, if done with the 
FFT, is on the order of Mlog (M), there can be a great savings by 
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using the FFT for large M. 

The reason this procedure is not totally straightforward, is the 
length of the output of convolving a length-L block with a length- 
M filter is of length L + M — 1 . This means the output blocks cannot 
simply be concatenated but must be overlapped and added, hence 
the name for this algorithm is "Overlap-Add". 

The second issue that must be taken into account is the fact that the 
overlap-add steps need non-cyclic convolution and convolution by 
the FFT is cyclic. This is easily handled by appending L—\ zeros 
to the impulse response and M — 1 zeros to each input block so that 
all FFTs are of length M + L— 1. This means there is no aliasing 
and the implemented cyclic convolution gives the same output as 
the desired non-cyclic convolution. 

The savings in arithmetic can be considerable when implementing 
convolution or performing FIR digital filtering. However, there are 
two penalties. The use of blocks introduces a delay of one block 
length. None of the first block of output can be calculated until 
all of the first block of input is available. This is not a problem 
for "off line" or "batch" processing but can be serious for real-time 
processing. The second penalty is the memory required to store 
and process the blocks. The continuing reduction of memory cost 
often removes this problem. 

The efficiency in terms of number of arithmetic operations per out- 
put point increases for large blocks because of the Mlog (M) re- 
quirements of the FFT. However, the blocks become very large 
(L > > M), much of the input block will be the appended zeros and 
efficiency is lost. For any particular application, taking the particu- 
lar filter and FFT algorithm being used and the particular hardware 
being used, a plot of efficiency vs. block length, L should be made 
and L chosen to maximize efficiency given any other constraints 
that are applicable. 
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Usually, the block convolutions are done by the FFT, but they could 
be done by any efficient, finite length method. One could use "rect- 
angular transforms" or "number-theoretic transforms". A general- 
ization of this method is presented later in the notes. 

13.1.2 Fast Convolution by Overlap-Save 

An alternative approach to the Overlap-Add can be developed by 
starting with segmenting the output rather than the input. If one 
considers the calculation of a block of output, it is seen that not 
only the corresponding input block is needed, but part of the pre- 
ceding input block also needed. Indeed, one can show that a length 
M + L — 1 segment of the input is needed for each output block. 
So, one saves the last part of the preceding block and concatenates 
it with the current input block, then convolves that with h (n) to 
calculate the current output 



13.2 Block Processing, a Generalization of 
Overlap Methods 

Convolution is intimately related to the DFT. It was shown in The 
DFT as Convolution or Filtering (Chapter 5) that a prime length 
DFT could be converted to cyclic convolution. It has been long 
known [276] that convolution can be calculated by multiplying the 
DFTs of signals. 

An important question is what is the fastest method for calculating 
digital convolution. There are several methods that each have some 
advantage. The earliest method for fast convolution was the use 
of sectioning with overlap-add or overlap-save and the FFT [276], 
[300], [66]. In most cases the convolution is of real data and, there- 
fore, real-data FFTs should be used. That approach is still proba- 
bly the fastest method for longer convolution on a general purpose 



207 



computer or microprocessor. The shorter convolutions should sim- 
ply be calculated directly. 

13.3 Introduction 

The partitioning of long or infinite strings of data into shorter sec- 
tions or blocks has been used to allow application of the FFT to 
realize on-going or continuous convolution [368], [181]. This sec- 
tion develops the idea of block processing and shows that it is a 
generalization of the overlap-add and overlap-save methods [368], 
[147]. They further generalize the idea to a multidimensional for- 
mulation of convolution [3], [47]. Moving in the opposite direc- 
tion, it is shown that, rather than partitioning a string of scalars into 
blocks and then into blocks of blocks, one can partition a scalar 
number into blocks of bits and then include the operation of mul- 
tiplication in the signal processing formulation. This is called dis- 
tributed arithmetic [45] and, since it describes operations at the bit 
level, is completely general. These notes try to present a coherent 
development of these ideas. 

13.4 Block Signal Processing 

In this section the usual convolution and recursion that imple- 
ments FIR and IIR discrete-time filters are reformulated in terms 
of vectors and matrices. Because the same data is partitioned and 
grouped in a variety of ways, it is important to have a consistent 
notation in order to be clear. The n element of a data sequence 
is expressed h(n) or, in some cases to simplify, h n . A block or 
finite length column vector is denoted h n with n indicating the n th 
block or section of a longer vector. A matrix, square or rectangu- 
lar, is indicated by an upper case letter such as H with a subscript 
if appropriate. 
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13.4.1 Block Convolution 

The operation of a finite impulse response (FIR) filter is described 
by a finite convolution as 



L-l 



y ( n ) — J! ^ (£) x(n — k) 



(13.4) 



fc=0 



where x (») is causal, h (n) is causal and of length L, and the time 
index n goes from zero to infinity or some large value. With a 
change of index variables this becomes 



y (ft) — ^h (n — k) x (k) 
which can be expressed as a matrix operation by 



(13.5) 
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(13.6) 



The H matrix of impulse response values is partitioned into ,/V by 
Af square sub matrices and the X and Y vectors are partitioned into 
length-Af blocks or sections. This is illustrated for ,/V — 3 by 
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Hn = 



h 
h\ h^ 
h 2 hi h 



hj hi hi 
h\ hi, h 2 
h 5 h 4 h 3 



etc. 



//i = 



(13.7) 
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etc. 



Substituting these definitions into (13.6) gives 

H • 
H { H 

#2 H\ Hq 








*0 

*2 









(13.8) 



(13.9) 
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th 



The general expression for the n output block is 



>',- 



n 

J^H n _ k x k 

k=0 



(13.10) 



which is a vector or block convolution. Since the matrix-vector 
multiplication within the block convolution is itself a convolution, 
(13.10) is a sort of convolution of convolutions and the finite length 
matrix-vector multiplication can be carried out using the FFT or 
other fast convolution methods. 

The equation for one output block can be written as the product 



y. = [H 2 HiH ] 



^o 



(13.11) 



and the effects of one input block can be written 
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(13.12) 



These are generalize statements of overlap save and overlap add 
[368], [147]. The block length can be longer, shorter, or equal to 
the filter length. 



13.4.2 Block Recursion 

Although less well-known, IIR filters can also be implemented 
with block processing [145], [74], [396], [43], [44]. The block 
form of an IIR filter is developed in much the same way as for the 
block convolution implementation of the FIR filter. The general 
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constant coefficient difference equation which describes an IIR fil- 
ter with recursive coefficients a/, convolution coefficients b\, input 
signal x (n), and output signal y (n) is given by 



N-l 



M-l 



y( n )= J^aiy n - t + Y^bkXn-k (13.13) 

1=1 fc=o 

using both functional notation and subscripts, depending on which 
is easier and clearer. The impulse response h (n) is 

N-l M-l 

h(n)= £ a t h (n - 1) + £ b k 8 (n - k) (13.14) 

1=1 k=0 

which can be written in matrix operator form 

1 ••• 

a x 1 
a% a\ 1 

£?3 Cl2 CL\ 
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(13.15) 



In terms ofNbyN submatrices and length-A^ blocks, this becomes 



A ••• 
A X Aq 
Ax A 
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hx 




b_x 




h 2 
















(13.16) 



From this formulation, a block recursive equation can be written 
that will generate the impulse response block by block. 



A h n +Axh n _ l =0 for n > 2 



(13.17) 
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tn = ~ A O lA l bn-l = K kn-\ for " > 2 (13.18) 

with initial conditions given by 

h^-A-'A.A-'bQ + A^b, (13.19) 

This can also be written to generate the square partitions of the 
impulse response matrix by 

H n = KH n _i for n > 2 (13.20) 

with initial conditions given by 

Hi=KAq 1 B + Aq 1 B 1 (13.21) 

ane K — —Aq 1 A\. This recursively generates square submatrices 
of H similar to those defined in (13.7) and (13.9) and shows the 
basic structure of the dynamic system. 

Next, we develop the recursive formulation for a general input as 
described by the scalar difference equation (13.14) and in matrix 
operator form by 
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(13.22) 



which, after substituting the definitions of the sub matrices and 
assuming the block length is larger than the order of the numerator 
or denominator, becomes 
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Z 
Zl 



h 

*1 

*2 



(13.23) 



From the partitioned rows of (13.24), one can write the block re- 
cursive relation 



A oi„+i +A iz„ = B ox„+i+Bix n 

Solving for y , gives 

c —n+i ° 



(13.24) 



l n+l = -V^+V^+i+V*!** (13.25) 



y n+l = K y n + H ox n+ i+HiXn (13-26) 

which is a first order vector difference equation [43], [44]. This 
is the fundamental block recursive algorithm that implements the 
original scalar difference equation in (13.14). It has several impor- 
tant characteristics. 

• The block recursive formulation is similar to a state variable 
equation but the states are blocks or sections of the output 
[44], [220], [427], [428]. 
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The eigenvalues of K are the poles of the original scalar 
problem raised to the N power plus others that are zero. The 
longer the block length, the "more stable" the filter is, i.e. 
the further the poles are from the unit circle [43], [44], [427], 
[15], [16]. 

If the block length were shorter than the denominator, the 
vector difference equation would be higher than first order. 
There would be a non zero Ai. If the block length were 
shorter than the numerator, there would be a non zero B^ 
and a higher order block convolution operation. If the block 
length were one, the order of the vector equation would be 
the same as the scalar equation. They would be the same 
equation. 

The actual arithmetic that goes into the calculation of the 
output is partly recursive and partly convolution. The longer 
the block, the more the output is calculated by convolution 
and, the more arithmetic is required. 

It is possible to remove the zero eigenvalues in K by making 
K rectangular or square and N by N This results in a form 
even more similar to a state variable formulation [240], [44]. 
This is briefly discussed below in section 2.3. 
There are several ways of using the FFT in the calculation 
of the various matrix products in (13.25) and in (13.27) and 
(13.28). Each has some arithmetic advantage for various 
forms and orders of the original equation. It is also possi- 
ble to implement some of the operations using rectangular 
transforms, number theoretic transforms, distributed arith- 
metic, or other efficient convolution algorithms [44], [427], 
[54], [48], [426], [286]. 

By choosing the block length equal to the period, a periodi- 
cally time varying filter can be made block time invariant. In 
other words, all the time varying characteristics are moved 
to the finite matrix multiplies which leave the time invariant 
properties at the block level. This allows use of z-transform 
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and other time-invariant methods to be used for stability 
analysis and frequency response analysis [244], [245]. It 
also turns out to be related to filter banks and multi-rate fil- 
ters [222], [221], [97]. 



13.4.3 Block State Formulation 

It is possible to reduce the size of the matrix operators in the block 
recursive description (13.26) to give a form even more like a state 
variable equation [240], [44], [428]. If K in (13.26) has several 
zero eigenvalues, it should be possible to reduce the size of K until 
it has full rank. That was done in [44] and the result is 

z n = Kiz n _ l +K 2 x a (13.27) 

y n = H l z n _ l +H x n (13.28) 

where Hq is the same N by N convolution matrix, N\ is a rectan- 
gular Lby N partition of the convolution matrix H, K[ is a square 
N by N matrix of full rank, and K 2 is a rectangular N by L matrix. 

This is now a minimal state equation whose input and output are 
blocks of the original input and output. Some of the matrix multi- 
plications can be carried out using the FFT or other techniques. 

13.4.4 Block Implementations of Digital Filters 

The advantage of the block convolution and recursion implemen- 
tations is a possible improvement in arithmetic efficiency by using 
the FFT or other fast convolution methods for some of the multipli- 
cations in (13.10) or (13.25) [246], [247]. There is the reduction of 
quantization effects due to an effective decrease in the magnitude 
of the eigenvalues and the possibility of easier parallel implemen- 
tation for IIR filters. The disadvantages are a delay of at least one 
block length and an increased memory requirement. 
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These methods could also be used in the various filtering meth- 
ods for evaluating the DFT. This the chirp z-transform, Rader's 
method, and Goertzel's algorithm. 

13.4.5 Multidimensional Formulation 

This process of partitioning the data vectors and the operator ma- 
trices can be continued by partitioning (13.10) and (13.24) and cre- 
ating blocks of blocks to give a higher dimensional structure. One 
should use index mapping ideas rather than partitioned matrices 
for this approach [3], [47]. 

13.4.6 Periodically Time-Varying Discrete-Time 
Systems 

Most time-varying systems are periodically time-varying and this 
allows special results to be obtained. If the block length is set equal 
to the period of the time variations, the resulting block equations 
are time invariant and all to the time varying characteristics are 
contained in the matrix multiplications. This allows some of the 
tools of time invariant systems to be used on periodically time- 
varying systems. 

The PTV system is analyzed in [425], [97], [81], [244], the fil- 
ter analysis and design problem, which includes the decimation- 
interpolation structure, is addressed in [126], [245], [222], and the 
bandwidth compression problem in [221]. These structures can 
take the form of filter banks [387]. 

13.4.7 Multirate Filters, Filter Banks, and Wavelets 

Another area that is related to periodically time varying systems 
and to block processing is filter banks [387], [152]. Recently the 
area of perfect reconstruction filter banks has been further devel- 
oped and shown to be closely related to wavelet based signal anal- 
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ysis [97], [99], [151], [387]. The filter bank structure has several 
forms with the polyphase and lattice being particularly interesting. 

An idea that has some elements of multirate filters, perfect recon- 
struction, and distributed arithmetic is given in [142], [140], [141]. 
Parks has noted that design of multirate filters has some elements 
in common with complex approximation and of 2-D filter design 
[337], [338] and is looking at using Tang's method for these de- 
signs. 

13.4.8 Distributed Arithmetic 

Rather than grouping the individual scalar data values in a discrete- 
time signal into blocks, the scalar values can be partitioned into 
groups of bits. Because multiplication of integers, multiplication 
of polynomials, and discrete-time convolution are the same opera- 
tions, the bit-level description of multiplication can be mixed with 
the convolution of the signal processing. The resulting structure is 
called distributed arithmetic [45], [402]. It can be used to create 
an efficient table look-up scheme to implement an FIR or IIR filter 
using no multiplications by fetching previously calculated partial 
products which are stored in a table. Distributed arithmetic, block 
processing, and multi-dimensional formulations can be combined 
into an integrated powerful description to implement digital filters 
and processors. There may be a new form of distributed arithmetic 
using the ideas in [140], [141]. 

13.5 Direct Fast Convolution and Rectangu- 
lar Transforms 

A relatively new approach uses index mapping directly to convert a 
one dimensional convolution into a multidimensional convolution 
[47], [8]. This can be done by either a type-1 or type-2 map. The 
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short convolutions along each dimension are then done by Wino- 
grad's optimal algorithms. Unlike for the case of the DFT, there 
is no savings of arithmetic from the index mapping alone. All the 
savings comes from efficient short algorithms. In the case of in- 
dex mapping with convolution, the multiplications must be nested 
together in the center of the algorithm in the same way as for the 
WFTA. There is no equivalent to the PFA structure for convolu- 
tion. The multidimensional convolution can not be calculated by 
row and column convolutions as the DFT was by row and column 
DFTs. 

It would first seem that applying the index mapping and optimal 
short algorithms directly to convolution would be more efficient 
than using DFTs and converting them to convolution to be cal- 
culated by the same optimal algorithms. In practical algorithms, 
however, the DFT method seems to be more efficient [286]. 

A method that is attractive for special purpose hardware uses dis- 
tributed arithmetic [45]. This approach uses a table look up of 
precomputed partial products to produce a system that does convo- 
lution without requiring multiplications [79]. 

Another method that requires special hardware uses number theo- 
retic transforms [31], [237], [265] to calculate convolution. These 
transforms are defined over finite fields or rings with arithmetic 
performed modulo special numbers. These transforms have rather 
limited flexibility, but when they can be used, they are very effi- 
cient. 
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13.6 Number Theoretic Transforms for 
Convolution 

13.6.1 Results from Number Theory 

A basic review of the number theory useful for signal processing 
algorithms will be given here with specific emphasis on the congru- 
ence theory for number theoretic transforms [279], [165], [260], 
[237], [328]. 

13.6.2 Number Theoretic Transforms 

Here we look at the conditions placed on a general linear transform 
in order for it to support cyclic convolution. The form of a linear 
transformation of a length-N sequence of number is given by 

AT— 1 

X(k)= Y,t(n, k )x(n) (13.29) 

for k — 0, 1, • • • , (N — 1). The definition of cyclic convolution of 
two sequences is given by 

N-l 

y(n)= £jc(m)/i(n-m) (13.30) 

m=0 

for n — 0, 1, • • • , (TV— 1) and all indices evaluated modulo N. We 
would like to find the properties of the transformation such that it 
will support the cyclic convolution. This means that if X (k), H (k), 
and Y (k) are the transforms of x(n), h («), and y (n) respectively, 

Y(k)=X(k)H(k). (13.31) 

The conditions are derived by taking the transform defined in 
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(13.4) of both sides of equation (13.5) which gives 

N-l N-l 

Y(k)= £ f ("' k ) L x H h ( n ~ m ) (13.32) 

n=0 m=0 

N-lN-l 

— ^ ^x(m) h(n — m) t(n,k) . (13.33) 

m=0n=0 

Making the change of index variables, I — n — m, gives 

N-lN-l 

= £ £*(m)fc(Z)f(Z + m,Jk). (13.34) 

m=0/=0 

But from (13.6), this must be 

JV-l N-l 

F(*) = £*(n)f(n,ifc) £jc(m)?(m,A;) (13.35) 

n=0 m=0 

N-lN-l 

= £ £*(m) h{l)t(n,k)t{l,k). (13.36) 

m=0/=0 

This must be true for all ;c(n), /z (n), and fc, therefore from (13.9) 
and (13.11) we have 

t(m + l,k) = t(m,k)t(l,k) (13.37) 

For / — we have 

t(m,k)=t(m,k)t(0,k) (13.38) 

and, therefore, ? (0, fc) = 1. For / — m we have 

t(2m,k) =t(m,k) t(m,k) = t 2 (m,k) (13.39) 

For / = pm we likewise have 

f (pm,k) = t p (m,k) (13.40) 
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and, therefore, 

t N (m,k) = t(Nm,k) = t{0,k) = 1. (13.41) 

But 

t(m,k) = t m (l,k) = t k (m,l), (13.42) 

therefore, 

t(m,k)=t mk (1,1). (13.43) 

Defining £ ( 1 , 1 ) = a gives the form for our general linear trans- 
form (13.4) as 

N-l 

X(k)= £a nfe jc(n) (13.44) 

«=o 

where a is a root of order N , which means that N is the smallest 
integer such that a N — 1 . 

Theorem 1 The transform (13.13) supports cyclic convolution if 
and only if a is a root of order ,/V and ./V" 1 is defined. 

This is discussed in [2], [4]. 

Theorem 2 The transform (13.13) supports cyclic convolution if 
and only if 

N\0(M) (13.45) 

where 

0(M) = gcd{ Pl -l,p 2 -l,---,Pl-l} (13-46) 

and 

M = p?pZ-~p?. (13.47) 
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This theorem is a more useful form of Theorem 1. Notice that 
N max = 0{M). 

One needs to find appropriate N, M, and a such that 

• N should be appropriate for a fast algorithm and handle the 
desired sequence lengths. 

• M should allow the desired dynamic range of the signals and 
should allow simple modular arithmetic. 

• a should allow a simple multiplication for a nk x(n). 

We see that if M is even, it has a factor of 2 and, therefore, 
O (M) — N max — 1 which implies M should be odd. If M is prime 
the O (M) —M—\ which is as large as could be expected in a field 
of M integers. For M — 2 k — 1, let k be a composite k — pq where 
p is prime. Then 2 P — 1 divides 2 pq — 1 and the maximum possi- 
ble length of the transform will be governed by the length possible 
for 2 P — 1 . Therefore, only the prime k need be considered inter- 
esting. Numbers of this form are know as Mersenne numbers and 
have been used by Rader [311]. For Mersenne number transforms, 
it can be shown that transforms of length at least 2p exist and the 
corresponding a — —2. Mersenne number transforms are not of as 
much interest because 2p is not highly composite and, therefore, 
we do not have FFT-type algorithms. 

For M — 2 k + 1 and k odd, 3 divides 2^+1 and the maximum 
possible transform length is 2. Thus we consider only even k. Let 
k — s2 l , where s is an odd integer. Then 2 2 divides 2 sl + 1 and 
the length of the possible transform will be governed by the length 
possible for 2 + 1 . Therefore, integers of the form M = 2 + 1 are 
of interest. These numbers are known as Fermat numbers [311]. 
Fermat numbers are prime for < t < 4 and are composite for all 
r>5. 

Since Fermat numbers up to F4 are prime, 0(F t ) — 2 b where b = 2 l 
and we can have a Fermat number transform for any length ,/V — 2 m 
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where m<b. For these Fermat primes the integer a — 3 is of order 
N — 2 h allowing the largest possible transform length. The integer 
a — 2 is of order N — 2b — 2 t+1 . This is particularly attractive 
since en to a power is multiplied times the data values in (13.4). 

The following table gives possible parameters for various Fermat 
number moduli. 



t 


b 


M = F t 


N 2 


N V2 


Nmax 


a for N max 
















3 


8 


2 8 + l 


16 


32 


256 


3 


4 


16 


2 16 + 1 


32 


64 


65536 


3 


5 


32 


2 32 + l 


64 


128 


128 


y/2 


6 


64 


2 64 + l 


128 


256 


256 


V2 



Table 13.1 



This table gives values of N for the two most important values 
of a which are 2 and y/2. The second column give the approxi- 
mate number of bits in the number representation. The third col- 
umn gives the Fermat number modulus, the fourth is the maximum 
convolution length for a = 2, the fifth is the maximum length for 
a — \pl, the sixth is the maximum length for any a, and the sev- 
enth is the a for that maximum length. Remember that the first 
two rows have a Fermat number modulus which is prime and sec- 
ond two rows have a composite Fermat number as modulus. Note 
the differences. 

The books, articles, and presentations that discuss NTT and re- 
lated topics are [209], [237], [265], [31], [253], [257], [288], [312], 
[311], [1], [55], [2], [4]. A recent book discusses NT in a signal 
processing context [215]. 



Chapter 14 

Comments: Fast Fourier 
Transforms 1 

14.1 Other work and Results 

This section comes from a note describing results on efficient algo- 
rithms to calculate the discrete Fourier transform (DFT) that were 
collected over years. Perhaps the most interesting is the discov- 
ery that the Cooley-Tukey FFT was described by Gauss in 1805 
[175]. That gives some indication of the age of research on the 
topic, and the fact that a 1995 compiled bibliography [363] on ef- 
ficient algorithms contains over 3400 entries indicates its volume. 
Three IEEE Press reprint books contain papers on the FFT [303], 
[84], [85]. An excellent general purpose FFT program has been 
described in [132], [129] and is used in Matlab and available over 
the internet. 

In addition to this book there are several others [238], [266], [25], 
[170], [383], [254], [33], [37], [345] that give a good modern the- 
oretical background for the FFT, one book [67] that gives the basic 
theory plus both FORTRAN and TMS 320 assembly language pro- 
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grams, and other books [219], [348], [70] that contain chapters on 
advanced FFT topics. A good up-to-date, on-line reference with 
both theory and programming techniques is in [11]. The history 
of the FFT is outlined in [87], [175] and excellent survey articles 
can be found in [115], [93]. The foundation of much of the mod- 
ern work on efficient algorithms was done by S. Winograd. These 
results can be found in [412], [415], [418]. An outline and discus- 
sion of his theorems can be found in [219] as well as [238], [266], 
[25], [170]. 

Efficient FFT algorithms for length-2 M were described by Gauss 
and discovered in modern times by Cooley and Tukey [91]. These 
have been highly developed and good examples of FORTRAN pro- 
grams can be found in [67]. Several new algorithms have been 
published that require the least known amount of total arithmetic 
[423], [108], [104], [229], [394], [71]. Of these, the split-radix 
FFT [108], [104], [392], [366] seems to have the best structure for 
programming, and an efficient program has been written [351] to 
implement it. A mixture of decimation-in-time and decimation-in- 
frequency with very good efficiency is given in [323], [324] and 
one called the Sine-Cosine FT [71]. Recently a modification to the 
split-radix algorithm has been described [203] that has a slightly 
better total arithmetic count. Theoretical bounds on the number of 
multiplications required for the FFT based on Winograd's theories 
are given in [170], [172]. Schemes for calculating an in-place, in- 
order radix-2 FFT are given in [17], [19], [196], [379]. Discussion 
of various forms of unscramblers is given in [51], [321], [186], 
[123], [318], [400], [424], [370], [315]. A discussion of the rela- 
tion of the computer architecture, algorithm and compiler can be 
found in [251], [242]. A modification to allow lengths ofN = q2 m 
for q odd is given in [24] . 

The "other" FFT is the prime factor algorithm (PFA) which uses an 
index map originally developed by Thomas and by Good. The the- 
ory of the PFA was derived in [214] and further developed and an 
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efficient in-order and in-place program given in [58], [67]. More 
results on the PFA are given in [377], [378], [379], [380], [364]. A 
method has been developed to use dynamic programming to design 
optimal FFT programs that minimize the number of additions and 
data transfers as well as multiplications [191]. This new approach 
designs custom algorithms for a particular computer architecture. 
An efficient and practical development of Winograd's ideas has 
given a design method that does not require the rather difficult Chi- 
nese remainder theorem [219], [199] for short prime length FFT's. 
These ideas have been used to design modules of length 11, 13, 17, 
19, and 25 [189]. Other methods for designing short DFT's can be 
found in [376], [223]. A use of these ideas with distributed arith- 
metic and table look-up rather than multiplication is given in [80]. 
A program that implements the nested Winograd Fourier transform 
algorithm (WFTA) is given in [238] but it has not proven as fast or 
as versatile as the PFA [58]. An interesting use of the PFA was 
announced [75] in searching for large prime numbers. 

These efficient algorithms can not only be used on DFT's but on 
other transforms with a similar structure. They have been applied 
to the discrete Hartley transform [354], [36] and the discrete cosine 
transform [394], [401], [314]. 

The fast Hartley transform has been proposed as a superior method 
for real data analysis but that has been shown not to be the case. 
A well-designed real-data FFT [360] is always as good as or better 
than a well-designed Hartley transform [354], [113], [289], [386], 
[371]. The Bruun algorithm [41], [369] also looks promising for 
real data applications as does the Rader-Brenner algorithm [310], 
[76], [386]. A novel approach to calculating the inverse DFT is 
given in [109]. 

General length algorithms include [340], [143], [125]. For lengths 
that are not highly composite or prime, the chirp z-transform in 
a good candidate [67], [307] for longer lengths and an efficient 
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order-N 2 algorithm called the QFT [343], [157], [160] for shorter 
lengths. A method which automatically generates near-optimal 
prime length Winograd based programs has been given in [199], 
[330], [332], [334], [336]. This gives the same efficiency for 
shorter lengths (i.e. N < 19) and new algorithms for much longer 
lengths and with well- structured algorithms. Another approach is 
given in [285]. Special methods are available for very long lengths 
[183], [365]. A very interesting general length FFT system called 
the FFTW has been developed by Frigo and Johnson at MIT. It 
uses a library of efficient "codelets" which are composed for a very 
efficient calculation of the DFT on a wide variety of computers 
[132], [129], [136]. For most lengths and on most computers, this 
is the fastest FFT today. Surprisingly, it uses a recursive program 
structure. The FFTW won the 1999 Wilkinson Prize for Numerical 
Software. 

The use of the FFT to calculate discrete convolution was one of 
its earliest uses. Although the more direct rectangular transform 
[9] would seem to be more efficient, use of the FFT or PFA is 
still probably the fastest method on a general purpose computer or 
DSP chip [287], [360], [113], [241]. On special purpose hardware 
or special architectures, the use of distributed arithmetic [80] or 
number theoretic transforms [5] may be even faster. Special al- 
gorithms for use with the short-time Fourier transform [346] and 
for the calculation of a few DFT values [349], [316], [347] and for 
recursive implementation [399], [129] have also been developed. 
An excellent analysis of efficient programming the FFT on DSP 
microprocessors is given in [243], [242]. Formulations of the DFT 
in terms of tensor or Kronecker products look promising for de- 
veloping algorithms for parallel and vector computer architectures 
[361], [383], [200], [390], [385], [154], [153]. 

Various approaches to calculating approximate DFTs have been 
based on cordic methods, short word lengths, or some form of 
pruning. A new method that uses the characteristics of the signals 
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being transformed has combined the discrete wavelet transform 
(DWT) combined with the DFT to give an approximate FFT with 
0(N) multiplications [162], [164], [69] for certain signal classes. 
A similar approach has been developed using filter banks [339], 
[185]. 

The study of efficient algorithms not only has a long history and 
large bibliography, it is still an exciting research field where new 
results are used in practical applications. 

More information can be found on the Rice DSP Group's web 
page 2 



http://www-dsp.rice.edu 
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Chapter 15 

Conclusions: Fast Fourier 
Transforms 1 



This book has developed a class of efficient algorithms based on 
index mapping and polynomial algebra. This provides a frame- 
work from which the Cooley-Tukey FFT, the split-radix FFT, the 
PFA, and WFTA can be derived. Even the programs implementing 
these algorithms can have a similar structure. Winograd's theorems 
were presented and shown to be very powerful in both deriving al- 
gorithms and in evaluating them. The simple radix-2 FFT provides 
a compact, elegant means for efficiently calculating the DFT. If 
some elaboration is allowed, significant improvement can be had 
from the split-radix FFT, the radix-4 FFT or the PFA. If multipli- 
cations are expensive, the WFTA requires the least of all. 

Several method for transforming real data were described that are 
more efficient than directly using a complex FFT. A complex FFT 
can be used for real data by artificially creating a complex input 
from two sections of real input. An alternative and slightly more 
efficient method is to construct a special FFT that utilizes the sym- 
metries at each stage. 
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As computers move to multiprocessors and multicore, writing and 
maintaining efficient programs becomes more and more difficult. 
The highly structured form of FFTs allows automatic generation of 
very efficient programs that are tailored specifically to a particular 
DSP or computer architecture. 

For high-speed convolution, the traditional use of the FFT or PFA 
with blocking is probably the fastest method although rectangular 
transforms, distributed arithmetic, or number theoretic transforms 
may have a future with special VLSI hardware. 

The ideas presented in these notes can also be applied to the cal- 
culation of the discrete Hartley transform [355], [112], the discrete 
cosine transform [119], [395], and to number theoretic transforms 
[32], [239], [267]. 

There are many areas for future research. The relationship of 
hardware to algorithms, the proper use of multiple processors, 
the proper design and use of array processors and vector proces- 
sors are all open. There are still many unanswered questions in 
multi-dimensional algorithms where a simple extension of one- 
dimensional methods will not suffice. 



Appendix 1: FFT 
Flowgraphs 



16.1 Signal Flow Graphs of Cooley-Tukey 
FFTs 

The following four figures are flow graphs for Radix-2 Cooley- 
Tukey FFTs. The first is a length- 16, decimation-in-frequency 
Radix-2 FFT with the input data in order and output data scram- 
bled. The first stage has 8 length-2 "butterflies" (which overlap in 
the figure) followed by 8 multiplications by powers of W which 
are called "twiddle factors". The second stage has 2 length-8 FFTs 
which are each calculated by 4 butterflies followed by 4 multiplies. 
The third stage has 4 length-4 FFTs, each calculated by 2 butterflies 
followed by 2 multiplies and the last stage is simply 8 butterflies 
followed by trivial multiplies by one. This flow graph should be 
compared with the index map in Polynomial Description of Signals 
(Chapter 4), the polynomial decomposition in The DFT as Convo- 
lution or Filtering (Chapter 5), and the program in Appendix 3. In 
the program, the butterflies and twiddle factor multiplications are 
done together in the inner most loop. The outer most loop indexes 
through the stages. If the length of the FFT is a power of two, the 
number of stages is that power (log N). 
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The second figure below is a length- 16, decimation-in-time FFT 
with the input data scrambled and output data in order. The first 
stage has 8 length-2 "butterflies" followed by 8 twiddle factors 
multiplications. The second stage has 4 length-4 FFTs which are 
each calculated by 2 butterflies followed by 2 multiplies. The third 
stage has 2 length-8 FFTs, each calculated by 4 butterflies followed 
by 8 multiplies and the last stage is simply 8 length-2 butterflies. 
This flow graph should be compared with the index map in Poly- 
nomial Description of Signals (Chapter 4), the polynomial decom- 
position in The DFT as Convolution or Filtering (Chapter 5), and 
the program in Appendix 3 (Chapter 18). Here, the FFT must be 
preceded by a scrambler. 

The third and fourth figures below are a length- 16 decimation-in- 
frequency and a decimation-in-time but, in contrast to the figures 
above, the DIF has the output in order which requires a scrambled 
input and the DIT has the input in order which requires the output 
be unscrambled. Compare with the first two figures. Note the order 
of the twiddle factors. The number of additions and multiplications 
in all four flow graphs is the same and the structure of the three- 
loop program which executes the flow graph is the same. 
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Figure 16.1: Length- 16, Decimation-in-Frequency, In-order 
input, Radix-2 FFT 
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Figure 16.2: Length-16, Decimation-in-Time, In-order out- 
put, Radix-2 FFT 
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Figure 16.3: Length-16, alternate Decimation-in-Frequency, 
In-order output, Radix-2 FFT 
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Figure 16.4: Length-16, alternate Decimation-in-Time, In- 
order input, Radix-2 FFT 



The following is a length-16, decimation-in-frequency Radix-4 
FFT with the input data in order and output data scrambled. There 
are two stages with the first stage having 4 length-4 "butterflies" 
followed by 12 multiplications by powers of W which are called 
"twiddle factors. The second stage has 4 length-4 FFTs which are 
each calculated by 4 butterflies followed by 4 multiplies. Note, 
each stage here looks like two stages but it is one and there is only 
one place where twiddle factor multiplications appear. This flow 
graph should be compared with the index map in Polynomial De- 
scription of Signals (Chapter 4), the polynomial decomposition in 
The DFT as Convolution or Filtering (Chapter 5), and the program 
in Appendix 3 (Chapter 18). Log to the base 4 of 16 is 2. The 
total number of twiddle factor multiplication here is 12 compared 
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to 24 for the radix-2. The unscrambler is a base-four reverse order 
counter rather than a bit reverse counter, however, a modification 
of the radix four butterflies will allow a bit reverse counter to be 
used with the radix-4 FFT as with the radix-2. 




Figure 16.5: Length-16, Decimation-in-Frequency, In-order 
input, Radix-4 FFT 



The following two flowgraphs are length-16, decimation-in- 
frequency Split Radix FFTs with the input data in order and output 
data scrambled. Because the "butterflies" are L shaped, the stages 
do not progress uniformly like the Radix-2 or 4. These two fig- 
ures are the same with the first drawn in a way to compare with 
the Radix-2 and 4, and the second to illustrate the L shaped butter- 
flies. These flow graphs should be compared with the index map 
in Polynomial Description of Signals (Chapter 4) and the program 
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in Appendix 3 (Chapter 18). Because of the non-uniform stages, 
the program indexing is more complicated. Although the num- 
ber of twiddle factor multiplications is 12 as was the radix-4 case, 
for longer lengths, the split-radix has slightly fewer multiplications 
than the radix-4. 

Because the structures of the radix-2, radix-4, and split-radix FFTs 
are the same, the number of data additions is same for all of them. 
However, each complex twiddle factor multiplication requires two 
real additions (and four real multiplications) the number of addi- 
tions will be fewer for the structures with fewer multiplications. 




Figure 16.6: Length-16, Decimation-in-Frequency, In-order 
input, Split-Radix FFT 
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Figure 16.7: Length-16, Decimation-in-Frequency, Split- 
Radix with special BFs FFT 
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Appendix 2: Operation 

Counts for General Length 
FFT 

17.1 Figures 

The Glassman-Ferguson FFT is a compact implementation of a 
mixed-radix Cooley-Tukey FFT with the short DFTs for each fac- 
tor being calculated by a Goertzel-like algorithm. This means there 
are twiddle factor multiplications even when the factors are rel- 
atively prime, however, the indexing is simple and compact. It 
will calculate the DFT of a sequence of any length but is efficient 
only if the length is highly composite. The figures contain plots of 
the number of floating point multiplications plus additions vs. the 
length of the FFT. The numbers on the vertical axis have relative 
meaning but no absolute meaning. 



This content is available online at <http://cnx.org/content/ml6353/L8/>. 
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Figure 17.1: Flop-Count vs Length for the Glassman- 
Ferguson FFT 



Note the parabolic shape of the curve for certain values. The upper 
curve is for prime lengths, the next one is for lengths that are two 
times a prime, and the next one is for lengths that are for three times 
a prime, etc. The shape of the lower boundary is roughly N log N. 
The program that generated these two figures used a Cooley-Tukey 
FFT if the length is two to a power which accounts for the points 
that are below the major lower boundary. 
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1500 2000 



Figure 17.2: Flop-Count vs Length for the Glassman- 
Ferguson FFT 
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Appendix 3: FFT Computer 
Programs 1 

18.1 Goertzel Algorithm 

A FORTRAN implementation of the first-order Goertzel algorithm 
with in-order input as given in () and [68] is given below. 



This content is available online at <http://cnx.org/content/ml7397/!. 5/>. 
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C 

C GOERTZEL'S DFT ALGORITHM 

C First order, input inorder 

C C. S. BURRUS, SEPT 1983 

C 

SUBROUTINE DFT(X ,Y, A ,B ,N) 

REAL X(260), Y(260), A(260), B(260) 

Q = 6.283185307179586/N 

DO 20 J=l, N 

C = C0S(Q*(J-D) 

S = SIN(Q*(J-1)) 

AT = X(l) 

BT = Y(l) 

DO 30 I = 2, N 

T = C*AT - S*BT + X(I) 
BT = C*BT + S*AT + Y(I) 
AT = T 
30 CONTINUE 

A(J) = C*AT - S*BT 
B(J) = C*BT + S*AT 
20 CONTINUE 
RETURN 
END 

Listing 18.1: First Order Goertzel Algorithm 



18.2 Second Order Goertzel Algorithm 

Below is the program for a second order Goertzel algorithm. 
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C 

C GOERTZEL'S DFT ALGORITHM 

C Second order, input inorder 

C C. S. BURRUS, SEPT 1983 

C 

SUBROUTINE DFT(X ,Y, A ,B ,N) 
REAL X(260), Y(260), A(260), B(260) 
C 

Q = 6.283185307179586/N 
DO 20 J = 1, N 

C = C0S(Q*(J-D) 
S = SIN(Q*(J-1)) 
CC = 2*C 
A2 = 
B2 = 
Al = X(l) 
Bl = Y(l) 
DO 30 I = 2, N 
T = Al 

Al = CC*A1 - A2 + X(I) 
A2 = T 
T = Bl 

Bl = CC*B1 - B2 + Y(I) 
B2 = T 
30 CONTINUE 

A(J) = C*A1 - A2 - S*B1 
B(J) = C*B1 - B2 + S*A1 
20 CONTINUE 
C 

RETURN 
END 

Listing 18.2: Second Order Goertzel Algorithm 
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18.3 Second Order Goertzel Algorithm 2 

Second order Goertzel algorithm that calculates two outputs at a 
time. 

C 

C GOERTZEL 'S DFT ALGORITHM, Second order 

C Input inorder, output by twos; C.S. Burrus, SEPT 1991 

C 

SUBROUTINE DFT(X ,Y, A ,B ,N) 
REAL X(260), Y(260), A(260), B(260) 
Q = 6.283185307179586/N 
DO 20 J = 1, N/2 + 1 
C = C0S(Q*(J-D) 
S = SIN(Q*(J-1)) 
CC = 2*C 
A2 = 
B2 = 
Al = X(l) 
Bl = Y(l) 
DO 30 I = 2, N 

T = Al 

Al = COA1 - A2 + X(I) 

A2 = T 

T = Bl 

Bl = CC*B1 - B2 + Y(I) 

B2 = T 
30 CONTINUE 

A2 = C*A1 - A2 

T = S*B1 

A(J) = A2 - T 

A(N-J+2) = A2 + T 

B2 = C*B1 - B2 

T = S*A1 

B(J) = B2 + T 
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B(M-J+2) = B2 - T 
20 CONTINUE 
RETURN 
END 

Figure. Second Order Goertzel Calculating Two Outputs at a T 

18.4 Basic QFT Algorithm 

A FORTRAN implementation of the basic QFT algorithm is given 
below to show how the theory is implemented. The program is 
written for clarity, not to minimize the number of floating point 
operations. 
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C 
SUBROUTINE QDFT(X,Y,XX,YY,NN) 
REAL X(0 : 260) , Y (0 : 260) , XX (0 : 260) , YY(0 : 260) 
C 

Nl = NN - 1 
M2 = Ml/2 
N21 = NN/2 

Q = 6.283185308/NN 
DO 2 K = 0, N21 
SSX = X(0) 
SSY = Y(0) 
SDX = 
SDY = 
IF (MOD(NN,2).EQ.O) THEN 

SSX = SSX + C0S(3.1426*K)*X(N21) 
SSY = SSY + C0S(3.1426*K)*Y(N21) 
ENDIF 
DO 3 N = 1, N2 

SSX = SSX + (X(N) + X(NN-N))*COS(Q*N*K) 
SSY = SSY + (Y(N) + Y(NN-N))*COS(Q*N*K) 
SDX = SDX + (X(N) - X(NN-N))*SIN(Q*N*K) 
SDY = SDY + (Y(N) - Y(NN-N) ) *SIN(Q*N*K) 
CONTINUE 

XX (K) = SSX + SDY 
YY(K) = SSY - SDX 
XX(NN-K) = SSX - SDY 
YY(NN-K) = SSY + SDX 
CONTINUE 
RETURN 
END 



Listing 18.3: Simple QFT Fortran Program 
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18.5 Basic Radix-2 FFT Algorithm 

Below is the Fortran code for a simple Decimation-in-Frequency, 
Radix-2, one butterfly Cooley-Tukey FFT followed by a bit- 
reversing unscrambler. 

C 
C A COOLEY-TUKEY RADIX-2, DIF FFT PROGRAM 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 
C 





SUBROUTINE FFT (X,Y,N,M) 






REAL X(l), Y(l) 




c- 

c 


MATM FFT T DOPCI 




riii-Li\i rn Liuuro 






N2 = N 






DO 10 K = 1, M 






Nl = N2 






N2 = N2/2 






E = 6.283185307179586/N1 






A = 






DO 20 J = 1, N2 






C = COS (A) 






S = SIN (A) 






A = J*E 






DO 30 I = J, N, Nl 






L = I + N2 






XT = X(I) - 


X(L) 




X(I) = X(I) + 


X(L) 




YT = Y(I) - 


Y(L) 




Y(I) = Y(I) + 


Y(L) 




X(L) = C*XT + 


S*YT 




Y(L) = C*YT - 


S*XT 




30 CONTINUE 






20 CONTINUE 
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10 


CONTINUE 


c 




c 


DIGIT REVERSE COUNTER 


100 


J = 1 


Ml 


= N - 1 


DO 


104 1=1, Nl 




IF (I.GE.J) GOXTO 101 




XT = X(J) 




X(J) = X(I) 




X(I) = XT 




XT = Y(J) 




Y(J) = Y(I) 




Y(I) = XT 


101 


K = N/2 


102 


IF (K.GE.J) GOTO 103 




J = J - K 




K = K/2 




GOTO 102 


103 


J = J + K 


104 


CONTINUE 


RETURN 


END 


■ 



Figure: Radix-2, DIF, One Butterfly Cooley-Tukey FFT 



18.6 Basic DIT Radix-2 FFT Algorithm 

Below is the Fortran code for a simple Decimation-in-Time, Radix- 
2, one butterfly Cooley-Tukey FFT preceeded by a bit-reversing 
scrambler. 

C 
C A COOLEY-TUKEY RADIX-2, DIT FFT PROGRAM 



APPENDIX 255 

C COMPLEX INPUT DATA IN ARRAYS X AND Y 

C C. S. BURRUS, RICE UNIVERSITY, SEPT 1985 

C 

C 

SUBROUTINE FFT (X,Y,N,M) 
REAL X(l), Y(l) 
C DIGIT REVERSE COUNTER 

C 

100 J = 1 
Nl = N - 1 

DO 104 1=1, Nl 

IF (I.GE.J) GOTO 101 
XT = X(J) 
X(J) = X(I) 
X(I) = XT 
XT = Y(J) 
Y(J) = Y(I) 
Y(I) = XT 

101 K = N/2 

102 IF (K.GE.J) GOTO 103 
J = J - K 

K = K/2 
GOTO 102 

103 J = J + K 

104 CONTINUE 

C MAIN FFT LOOPS 

C 

N2 = 1 

DO 10 K = 1, M 

E = 6. 283185307179586/ (2*N2) 

A = 

DO 20 J = 1, N2 

C = COS (A) 

S = SIN (A) 
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A = J*E 




DO 30 I = J, N, 2*N2 




L = I + M2 




XT = C*X(L) 


+ S*Y(L) 


YT = C*Y(L) 


- S*X(L) 


X(L) = X(I) 


- XT 


X(I) = X(I) 


+ XT 


Y(L) = Y(I) 


- YT 


Y(I) = Y(I) 


+ YT 


30 CONTINUE 




20 CONTINUE 




N2 = N2+N2 




10 CONTINUE 




C 




RETURN 




END 





18.7 DIF Radix-2 FFT Algorithm 

Below is the Fortran code for a Decimation-in-Frequency, Radix- 
2, three butterfly Cooley-Tukey FFT followed by a bit-reversing 
unscrambler. 

C A COOLEY-TUKEY RADIX 2, DIF FFT PROGRAM 
C THREE -BF, MULT BY 1 AND J ARE REMOVED 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C TABLE LOOK-UP OF W VALUES 

C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 
C 

SUBROUTINE FFT (X,Y,N,M,WR,WI) 
REAL X(l), Y(l), WR(1), WI(1) 
C MAIN FFT LOOPS 

C 

N2 = N 
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DO 10 K = 1, M 
Ml = M2 
N2 = N2/2 
JT = N2/2 + 1 
DO 1 I = 1, N, Ml 
L = I + M2 
T = X(I) - X(L) 
X(I) = X(I) + X(L) 
X(L) = T 

T = Y(I) - Y(L) 
Y(I) = Y(I) + Y(L) 
Y(L) = T 
1 CONTINUE 

IF (K.EQ.M) GOTO 10 

IE = N/Nl 

IA = 1 

DO 20 J = 2, N2 

IA = IA + IE 

IF (J.EQ.JT) GOTO 50 

C = WR(IA) 

S = WI(IA) 

DO 30 I = J, N, Nl 

L = I + N2 

T = X(I) - X(L) 

X(I) = X(I) + X(L) 

TY = Y(I) - Y(L) 

Y(I) = Y(I) + Y(L) 

X(L) = C*T + S*TY 

Y(L) = C*TY - S*T 
30 CONTINUE 

GOTO 25 
50 DO 40 I = J, N, Nl 

L = I + N2 

T = X(I) - X(L) 
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X(I) = X(I) 


+ X(L) 




TY = Y(I) 


- Y(L) 




Y(I) = Y(I) 


+ Y(L) 




X(L) = TY 






Y(L) =-T 




40 


CONTINUE 




25 


A = J*E 




20 


CONTINUE 




10 


CONTINUE 

nTfiTT RWPT 


JClj? ("DTI 



RETURN 
END 



18.8 Basic DIF Radix-4 FFT Algorithm 

Below is the Fortran code for a simple Decimation-in-Frequency, 
Radix-4, one butterfly Cooley-Tukey FFT to be followed by an 
unscrambles 

C A COOLEY-TUKEY RADIX-4 DIF FFT PROGRAM 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C LENGTH IS N = 4 ** M 

C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 
C 

SUBROUTINE FFT4 (X.Y.N.M) 
REAL X(l), Y(l) 
C MAIN FFT LOOPS 

N2 = N 

DO 10 K = 1, M 

Nl = N2 

N2 = N2/4 

E = 6.283185307179586/N1 

A = 
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C MAIM BUTTERFLIES 

DO 20 J=l, M2 

B = A + A 
C = A + B 
C01 = COS(A) 
C02 = COS(B) 
C03 = COS(C) 

511 = SIN(A) 

512 = SIN(B) 

513 = SIN(C) 
A = J*E 

C BUTTERFLIES WITH SAME W- 

DO 30 I=J, N, Ml 

11 = I + M2 

12 = II + M2 

13 = 12 + M2 

Rl = X(I ) + X(I2) 
R3 = X(I ) - X(I2) 

51 = Y(I ) + Y(I2) 

53 = Y(I ) - Y(I2) 
R2 = X(I1) + X(I3) 
R4 = X(I1) - X(I3) 

52 = Y(I1) + Y(I3) 

54 = Y(I1) - Y(I3) 
X(I) = Rl + R2 

R2 = Rl - R2 
Rl = R3 - S4 
R3 = R3 + S4 
Y(I) = SI + S2 

52 = SI - S2 
SI = S3 + R4 

53 = S3 - R4 

X(I1) = C01*R3 + SI1*S3 
Y(I1) = C01*S3 - SI1*R3 
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X(I2) = C02*R2 + SI2*S2 




Y(I2) = C02*S2 - SI2*R2 




X(I3) = C03*R1 + SI3*S1 




Y(I3) = C03*S1 - SI3*R1 


30 


CONTINUE 


20 


CONTINUE 


10 


CONTINUE 


C 


DIGIT REVERSE COUNTER goes here 


RETURN 


END 





18.9 Basic DIF Radix-4 FFT Algorithm 

Below is the Fortran code for a Decimation-in-Frequency, Radix- 
4, three butterfly Cooley-Tukey FFT followed by a bit-reversing 
unscrambler. Twiddle factors are precalculated and stored in arrays 
WR and WI. 

C 
C A COOLEY-TUKEY RADIX-4 DIF FFT PROGRAM 
C THREE BF, MULTIPLICATIONS BY 1, J, ETC. ARE REMOVED 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C LENGTH IS N = 4 ** M 
C TABLE LOOKUP OF W VALUES 
C 

C C. S. BURRUS, RICE UNIVERSITY, SEPT 1983 
C 
C 

c 

SUBROUTINE FFT4 (X,Y,N,M,WR,WI) 

REAL X(l), Y(l), WR(1), WI(1) 

DATA C21 / 0.707106778 / 
C 
C MAIN FFT LOOPS 
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C 










M2 


= N 






DO 


10 K 


: = i, m 






Ml 


= M2 






M2 


= M2/4 






JT 


= M2/2 + 1 


c- 






SPECIAL BUTTERFLY FOR W - 






DO 


1 I = 1, N, Nl 

11 = I + M2 

12 = 11 + N2 

13 = 12 + M2 

Rl = X(I ) + X(I2) 
R3 = X(I ) - X(I2) 

51 = Y(I ) + Y(I2) 

53 = Y(I ) - Y(I2) 
R2 = X(I1) + X(I3) 
R4 = X(I1) - X(I3) 

52 = Y(I1) + Y(I3) 

54 = Y(I1) - Y(I3) 


c 






X(I) = Rl + R2 
X(I2)= Rl - R2 
X(I3)= R3 - S4 
X(I1)= R3 + S4 


c 






Y(I) = SI + S2 
Y(I2)= SI - S2 
Y(I3)= S3 + R4 
Y(I1)= S3 - R4 


c 










1 




CONTINUE 






IF 


(K.EQ.M) GOTO 10 






IE 


= N/Nl 






IA1 


= 1 
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c 


GENERAL BUTTERFLY 




DO 20 J : 


= 2, N2 




IA1 = IA1 + IE 




IF (J.EQ 


.JT) GOTO 50 




IA2 = IA1 + IA1 - 1 




I A3 


= IA2 + IA1 - 1 




C01 


= WR(IAl) 




C02 


= WR(IA2) 




C03 


= WR(IA3) 




SI1 


= WI(IAl) 




SI2 


= WKIA2) 




SI3 


= WKIA3) 


c 




-BUTTERFLIES WITH SAME W 




DO 30 I = J, N, Ml 




11 = 


I + M2 




12 = 


11 + M2 




13 = 


12 + M2 




Rl = 


X(I ) + X(I2) 




R3 = 


X(I ) - X(I2) 




SI = 


Y(I ) + Y(I2) 




S3 = 


Y(I ) - Y(I2) 




R2 = 


X(I1) + X(I3) 




R4 = 


X(I1) - X(I3) 




S2 = 


Y(I1) + Y(I3) 




S4 = 


Y(I1) - Y(I3) 


c 








X(I) 


= Rl + R2 




R2 


= Rl - R2 




Rl 


= R3 - S4 




R3 


= R3 + S4 


c 








Y(I) 


= SI + S2 




S2 


= SI - S2 
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SI = S3 + R4 
S3 = S3 - R4 
C 

X(I1) = C01*R3 + SI1*S3 
Y(I1) = C01*S3 - SI1*R3 
X(I2) = C02*R2 + SI2*S2 
Y(I2) = C02*S2 - SI2*R2 
X(I3) = C03*R1 + SI3*S1 
Y(I3) = C03*S1 - SI3*R1 
30 CONTINUE 

GOTO 20 

C SPECIAL BUTTERFLY FOR W = J 

50 DO 40 I = J, N, Nl 

11 = I + N2 

12 = II + N2 

13 = 12 + N2 

Rl = X(I ) + X(I2) 
R3 = X(I ) - X(I2) 

51 = Y(I ) + Y(I2) 

53 = Y(I ) - Y(I2) 
R2 = X(I1) + X(I3) 
R4 = X(I1) - X(I3) 

52 = Y(I1) + Y(I3) 

54 = Y(I1) - Y(I3) 
C 

X(I) = Rl + R2 
Y(I2)=-R1 + R2 
Rl = R3 - S4 
R3 = R3 + S4 
C 

Y(I) = SI + S2 
X(I2)= SI - S2 
SI = S3 + R4 

53 = S3 - R4 
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X(I1) = (S3 + R3)*C21 
Y(I1) = (S3 - R3)*C21 
X(I3) = (SI - R1)*C21 
Y(I3) =-(Sl + R1)*C21 
40 CONTINUE 

20 CONTINUE 
10 CONTINUE 

C DIGIT REVERSE COUNTER-- 

100 J = 1 
Nl = N - 1 
DO 104 I = 1, Nl 

IF (I.GE.J) GOTO 101 

Rl = X(J) 

X(J) = X(I) 

X(I) = Rl 

Rl = Y(J) 

Y(J) = Y(I) 

Y(I) = Rl 

101 K = N/4 

102 IF (K*3.GE.J) GOTO 103 

J = J - K*3 
K = K/4 
GOTO 102 

103 J = J + K 

104 CONTINUE 
RETURN 

END 



18.10 Basic DIF Split Radix FFT Algorithm 

Below is the Fortran code for a simple Decimation-in-Frequency, 
Split-Radix, one butterfly FFT to be followed by a bit-reversing 
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unscrambles 

C A DUHAMEL-HOLLMANN SPLIT RADIX FFT PROGRAM 
C FROM: ELECTRONICS LETTERS, JAN. 5, 1984 
C COMPLEX INPUT DATA IN ARRAYS X AND Y 
C LENGTH IS N = 2 ** M 

C C. S. BURRUS, RICE UNIVERSITY, MARCH 1984 
C 
C 

SUBROUTINE FFT (X,Y,N,M) 
REAL X(l), Y(l) 
C MAIN FFT LOOPS 

C 

Nl = N 
N2 = N/2 
IP = 
IS = 1 

A = 6.283185307179586/N 
DO 10 K = 1, M-l 
JD = Nl + N2 
Nl = N2 
N2 = N2/2 
JO = N1*IP + 1 
IP = 1 - IP 
DO 20 J = JO, N, JD 

JS = 
JT = J + N2 - 1 

DO 30 I = J, JT 
JSS= JS*IS 
JS = JS + 1 

CI = COS(A*JSS) 
C3 = C0S(3*A*JSS) 
SI = -SIN(A*JSS) 
S3 = -SIN(3*A*JSS) 
II = I + N2 
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12 = 11 


+ N2 






13 = 12 


+ N2 






Rl 


X(I ) + 


X(I2) 




R2 


X(I ) - 


X(I2) 




R3 


X(I1) - 


X(I3) 




X(I2) = 


X(I1) + 


X(I3) 




X(I1) = 


Rl 






Rl 


y(i ) + 


Y(I2) 




R4 


Yd ) - 


Y(I2) 




R5 


Y(I1) - 


Y(I3) 




Y(I2) = 


Y(I1) + 


Y(I3) 




Y(I1) = 


Rl 






Rl 


R2 - R5 






R2 


R2 + R5 






R5 


R4 + R3 






R4 


R4 - R3 






X(I) = 


C1*R1 + 


S1*R5 




Y(I) = 


C1*R5 - 


S1*R1 




X(I3) = 


C3*R2 + 


S3*R4 




Y(I3) = 


C3*R4 - 


S3*R2 


30 


CONTINUE 




20 


CONTINUE 
IS = IS + IS 




10 


CONTINUE 






IP 


= 1 - IP 






JO 


= 2 - IP 






DO 


5 I = JO, N- 
11 = I + 1 


1, 3 






Rl = X(I) 


+ X(I1) 






X(I1) = X(I) 


- X(I1) 






X(I) = Rl 
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Rl = Y(I) + Y(I1) 

Y(I1) = Y(I) - Y(I1) 

Y(I) = Rl 
5 CONTINUE 

RETURN 
END 

18.11 DIF Split Radix FFT Algorithm 

Below is the Fortran code for a simple Decimation-in-Frequency, 
Split-Radix, two butterfly FFT to be followed by a bit-reversing 
unscrambles Twiddle factors are precalculated and stored in arrays 
WR and WI. 

C 

C A DUHAMEL-HOLLMAN SPLIT RADIX FFT 

C REF: ELECTRONICS LETTERS, JAN. 5, 1984 

C COMPLEX INPUT AND OUTPUT DATA IN ARRAYS X AND Y 

C LENGTH ISN=2**M, OUTPUT IN BIT -REVERSED ORDER 

C TWO BUTTERFLIES TO REMOVE MULTS BY UNITY C 

C SPECIAL LAST TWO STAGES 

C TABLE LOOK-UP OF SINE AND COSINE VALUES C 

C C.S. BURRUS, RICE UNIV. APRIL 1985 

C 

c 

SUBROUTINE FFT(X,Y,N,M,WR,WI) 
REAL X(1),Y(1),WR(1),WI(1) 
C81= 0.707106778 

N2 = 2*N 

DO 10 K = 1, M-3 

IS = 1 

ID = N2 

N2 = N2/2 

N4 = N2/4 
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40 



DO 


1 10 = IS, 


N-l, ID 


11 


= 10 + N4 




12 


= 11 + N4 






13 = 12 + 


N4 


Rl 


= X(I0) 


- X(I2) 


X(I0) = X(I0) 


+ X(I2) 


R2 


= Y(I1) 


- Y(I3) 


Y(I1) = Y(I1) 


+ Y(I3) 


X(I2) = Rl + R2 


R2 


= Rl - R2 


Rl 


= X(I1) 


- X(I3) 


X(I1) = X(I1) 


+ X(I3) 




X(I3) = R2 




R2 = Y(I0) - Y(I2) 




Y(I0) 


= Y(IO) + Y(I2) 




Y(I2) =-Rl + R2 




Y(I3) = Rl + R2 


CONTINUE 




IS 


= 2*ID - N2 + 1 


ID 


= 4*ID 






IF (IS.LT.N) GOTO 2 


IE 


= N/N2 
IA1 = 1 






DO 20 J = 


= 2, N4 




IA1 = 


IA1 + IE 




IA3 = 


3*IA1 - 2 




CC1 = 


WR(IAl) 




SSI = 


WI(IAl) 




CC3 = 


WR(IA3) 




SS3 = 


WKIA3) 




IS = 


J 




ID = 


2*N2 




DO 30 


10 = IS, N-l, ID 




11 = 10 + N4 
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12 = 11 + M4 




13 = 12 + M4 


c 






Rl = X(I0) - X(I2) 




X(I0) = X(I0) + X(I2) 




R2 = X(I1) - X(I3) 




X(I1) = X(I1) + X(I3) 




SI = Y(I0) - Y(I2) 




Y(I0) = Y(I0) + Y(I2) 




S2 = Y(I1) - Y(I3) 




Y(I1) = Y(I1) + Y(I3) 


c 






S3 = Rl - S2 




Rl = Rl + S2 




S2 = R2 - SI 




R2 = R2 + SI 




X(I2) = R1*CC1 - S2*SS1 




Y(I2) =-S2*CCl - R1*SS1 




X(I3) = S3*CC3 + R2*SS3 




Y(I3) = R2*CC3 - S3*SS3 


30 


CONTINUE 




IS = 2*ID - N2 + J 




ID = 4*ID 




IF (IS.LT.N) GOTO 40 


20 


CONTINUE 


10 


CONTINUE 


c 






IS = 1 




ID = 32 


50 


DO 60 I = IS, N, ID 




10 =1+8 




DO 15 J = 1, 2 




Rl = X(IO) + X(I0+2) 




R3 = X(IO) - X(I0+2) 
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R2 = X(I0+1) 


+ X(I0+3) 








R4 = XUO+1) 


- X(I0+3) 








X(IO) = Rl 


+ R2 








X(I0+1) = Rl 


- R2 








Rl = Y(IO) 


+ Y(I0+2) 








S3 = Y(IO) 


- Y(I0+2) 








R2 = Y(I0+1) 


+ Y(I0+3) 








S4 = Y(I0+1) 


- Y(I0+3) 








Y(IO) = Rl 


+ R2 








Y(I0+1) = Rl 


- R2 








Y(I0+2) = S3 


- R4 








Y(I0+3) = S3 


+ R4 








X(I0+2) = R3 


+ S4 








X(I0+3) = R3 


- S4 








10 = 10 + 4 






15 




CONTINUE 






60 


CONTINUE 










IS = 2*ID - 15 










ID = 4*ID 








IF 


(IS.LT.N) GOTO 50 






IS 


= 1 








ID 


= 16 






55 


DO 


65 10 = IS, N, ID 








Rl = X(IO) 


+ X(I0+4) 








R5 = X(IO) - : 


X(I0+4) 








R2 = X(I0+1) + : 


X(I0+5) 








R6 = X(I0+1) - : 


X(I0+5) 








R3 = X(I0+2) + : 


X(I0+6) 








R7 = X(I0+2) - : 


X(I0+6) 








R4 = X(I0+3) + : 


X(I0+7) 








R8 = X(I0+3) - : 


X(I0+7) 








Tl = Rl - R3 










Rl = Rl + R3 
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R3 = R2 - R4 

R2 = R2 + R4 

X(IO) = Rl + R2 

X(I0+1) = Rl - R2 



Rl = Y(IO) + 


Y(I0+4) 


S5 = Y(IO) 


Y(I0+4) 


R2 = Y(I0+1) + 


Y(I0+5) 


S6 = Y(I0+1) - 


Y(I0+5) 


S3 = Y(I0+2) + 


Y(I0+6) 


S7 = Y(I0+2) - 


Y(I0+6) 


R4 = Y(I0+3) + 


Y(I0+7) 


S8 = Y(I0+3) - 


Y(I0+7) 


T2 = Rl - S3 




Rl = Rl + S3 




S3 = R2 - R4 




R2 = R2 + R4 




Y(IO) = Rl + 


R2 


Y(I0+1) = Rl - 


R2 


X(I0+2) = Tl + 


S3 


X(I0+3) = Tl - 


S3 


Y(I0+2) = T2 - 


R3 


Y(I0+3) = T2 + 


R3 


Rl = (R6 - R8) 


*C81 


R6 = (R6 + R8): 


*C81 


R2 = (S6 - S8); 


*C81 


S6 = (S6 + S8); 


*C81 


Tl = R5 - Rl 




R5 = R5 + Rl 




R8 = R7 - R6 




R7 = R7 + R6 




T2 = S5 - R2 
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S5 = S5 + R2 






S8 = S7 - S6 






S7 = S7 + S6 






X(I0+4) = R5 + S7 






XdO+7) = R5 - S7 






X(I0+5) = Tl + S8 






XdO+6) = Tl - S8 






Y(I0+4) = S5 - R7 






Y(I0+7) = S5 + R7 






YdO+5) = T2 - R8 






Y(I0+6) = T2 + R8 




65 


CONTINUE 

IS = 2*ID - 7 

ID = 4*ID 
IF (IS.LT.N) GOTO 55 


c 






c- 




BIT REVERSE COUNTER 


c 








100 


J = 1 

Nl = N - 1 

DO 104 1=1, Nl 

IF (I.GE.J) GOTO 101 

XT = X(J) 

X(J) = X(I) 

X(I) = XT 

XT = Y(J) 

Y(J) = Y(I) 

Y(I) = XT 




101 


K = N/2 




102 


IF (K.GE.J) GOTO 103 
J = J - K 
K = K/2 
GOTO 102 




103 


J = J + K 
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104 



CONTINUE 

RETURN 

END 



18.12 Prime Factor FFT Algorithm 

Below is the Fortran code for a Prime-Factor Algorithm (PFA) FFT 
allowing factors of the length of 2, 3, 4, 5, and 7. It is followed by 
an unscrambles 



C 
C 

c 

c 

c 

c 

c 

c 

c 

c 

c- 

c 



A PRIME FACTOR FFT PROGRAM WITH GENERAL MODULES 
COMPLEX INPUT DATA IN ARRAYS X AND Y 
COMPLEX OUTPUT IN A AND B 
LENGTH N WITH M FACTORS IN ARRAY NI 

N = NI(1)*NI(2)* . . . *NI(M) 
UNSCRAMBLING CONSTANT UNSC 

UNSC = N/NKl) + N/NI(2) +...+ N/NI(M), MOD N 
C. S. BURRUS, RICE UNIVERSITY, JAN 1987 



SUBROUTINE PFA (X , Y , N , M , NI , A , B , UNSC) 

INTEGER NI(4), 1(16), UNSC 

REAL X(l), Y(l), A(l), B(l) 



DATA 


C31, 


C32 


/ 


DATA 


C51, 


C52 


/ 


DATA 


C53, 


C54 


/ 


DATA 


C55 




/ 


DATA 


C71, 


C72 


/ 


DATA 


C73, 


C74 


/ 



-0.86602540,-1.50000000 / 
0.95105652,-1.53884180 / 

-0.36327126, 0.55901699 / 

-1.25 / 

-1.16666667,-0.79015647 / 
0.055854267, 0.7343022 / 
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DATA C75, C76 / 0.44095855,-0.34087293 / 
DATA C77, C78 / 0.53396936, 0.87484229 / 

C 

C NESTED LOOPS 

C 

DO 10 K=l, M 

Ml = NI(K) 
N2 = N/Nl 
DO 15 J=l, M, Ml 
IT = J 
DO 30 L=l, Ml 
I(L) = IT 
A(L) = X(IT) 
B(L) = Y(IT) 

IT = IT + N2 

IF (IT.GT.M) IT = IT - M 
30 CONTINUE 

GOTO (20,102,103,104,105,20,107), Nl 
C 

C WFTA N=2 

C 

102 Rl = A(l) 
A(l) = Rl + A(2) 
A(2) = Rl - A(2) 

C 

Rl = B(l) 

B(l) = Rl + B(2) 

B(2) = Rl - B(2) 
C 

GOTO 20 

C WFTA N=3 

C 

103 R2 = (A(2) - A(3)) * C31 
Rl = A(2) + A(3) 
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A(l)= A(l) + Rl 

Rl = A(l) + Rl * C32 
C 

S2 = (B(2) - B(3)) * C31 

SI = B(2) + B(3) 

B(l)= B(l) + SI 

SI = B(l) + SI * C32 



A(2) = Rl - S2 

A(3) = Rl + S2 

B(2) = SI + R2 

B(3) = SI - R2 
C 

GOTO 20 
C 
C WFTA N=4- 

C 

104 Rl = A(l) + A(3) 

Tl = A(l) - A(3) 

R2 = A(2) + A(4) 

A(l) = Rl + R2 

A(3) = Rl - R2 
C 

Rl = B(l) + B(3) 

T2 = B(l) - B(3) 

R2 = B(2) + B(4) 

B(l) = Rl + R2 

B(3) = Rl - R2 
C 

Rl = A(2) - A(4) 

R2 = B(2) - B(4) 
C 

A(2) = Tl + R2 

A (4) = Tl - R2 
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B(2) = T2 - Rl 
B(4) = T2 + Rl 



GOTO 20 

C 

C WFTA N=5- 

C 

105 Rl = A(2) + A(5) 
R4 = A(2) - A(5) 
R3 = A(3) + A(4) 
R2 = A(3) - A(4) 



T = (Rl - R3) * C54 

Rl = Rl + R3 

A(l) = A(l) + Rl 

Rl = A(l) + Rl * C55 

R3 = Rl - T 

Rl = Rl + T 

T = (R4 + R2) * C51 
R4 = T + R4 * C52 
R2 = T + R2 * C53 

51 = B(2) + B(5) 
S4 = B(2) - B(5) 
S3 = B(3) + B(4) 

52 = B(3) - B(4) 

T = (SI - S3) * C54 

SI = SI + S3 

B(l) = B(l) + SI 

SI = B(l) + SI * C55 
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S3 = 


SI - 


- T 




SI = 


SI - 


i- T 


C 










T = 


(S4 + 


S2) * C51 




S4 = 


T + 


S4 * C52 




S2 = 


T + 


S2 * C53 


C 










A(2) 


= Rl 


+ S2 




A(5) 


= Rl 


- S2 




A(3) 


= R3 


- S4 




A(4) 


= R3 


+ S4 


c 










B(2) 


= SI 


- R2 




B(5) 


= SI 


+ R2 




B(3) 


= S3 


+ R4 




B(4) 


= S3 


- R4 


c 










GOTO 


20 




c 






UFTA M-7 


c 






W r 1 ii l\l ( 




107 ] 


11 = i 


U2) + A(7) 




R6 = 


A(2) 


- A(7) 




SI = 


B(2) 


+ B(7) 




S6 = 


B(2) 


- B(7) 




R2 = 


A(3) 


+ A(6) 




R5 = 


A(3) 


- A(6) 




S2 = 


B(3) 


+ B(6) 




S5 = 


B(3) 


- B(6) 




R3 = 


A(4) 


+ A(5) 




R4 = 


A(4) 


- A(5) 




S3 = 


B(4) 


+ B(5) 




S4 = 


B(4) 


- B(5) 
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T3 = (Rl - R2) * C74 
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T = (Rl - R3) * C72 
Rl = Rl + R2 + R3 
A(l) = A(l) + Rl 
Rl = A(l) + Rl * C71 
R2 =(R3 - R2) * C73 
R3 = Rl - T + R2 
R2 = Rl - R2 - T3 
Rl = Rl + T + T3 
T = (R6 - R5) * C78 
T3 =(R6 + R4) * C76 
R6 =(R6 + R5 - R4) * C75 
R5 =(R5 + R4) * C77 
R4 = R6 - T3 + R5 
R5 = R6 - R5 - T 
R6 = R6 + T3 + T 
C 

T3 = (SI - S2) * C74 
T = (SI - S3) * C72 
SI = SI + S2 + S3 
B(l) = B(l) + SI 

51 = B(l) + SI * C71 

52 =(S3 - S2) * C73 

53 = SI - T + S2 
S2 = SI - S2 - T3 
SI = SI + T + T3 

T = (S6 - S5) * C78 
T3 = (S6 + S4) * C76 
S6 = (S6 + S5 - S4) * C75 
S5 = (S5 + S4) * C77 

54 = S6 - T3 + S5 

55 = S6 - S5 - T 

56 = S6 + T3 + T 
C 

A(2) = R3 + S4 
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A(7) = R3 - S4 
A(3) = Rl + S6 
A(6) = Rl - S6 
A(4) = R2 - S5 
A(5) = R2 + S5 
B(4) = S2 + R5 
B(5) = S2 - R5 
B(2) = S3 - R4 
B(7) = S3 + R4 
B(3) = SI - R6 
B(6) = SI + R6 
C 

20 IT = J 

DO 31 L=l, Ml 
I(L) = IT 
X(IT) = A(L) 
Y(IT) = B(L) 

IT = IT + M2 

IF (IT.GT.N) IT = IT - N 
31 CONTINUE 
15 CONTINUE 
10 CONTINUE 
C 

C UNSCRAMBLING 

C 

L = 1 

DO 2 K=l, N 
A(K) = X(L) 
B(K) = Y(L) 

L = L + UNSC 
IF (L.GT.N) L = L - N 
2 CONTINUE 
RETURN 
END 
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18.13 In Place, In Order Prime Factor FFT 
Algorithm 

Below is the Fortran code for a Prime-Factor Algorithm (PFA) FFT 
allowing factors of the length of 2, 3, 4, 5, 7, 8,9, and 16. It is both 
in-place and in-order, so requires no unscrambler. 



c 


A PRIME FACTOR 


FFT 


PROGRAM 


c 


IN-PLACE AND IK 


f- ORDER 


c 


COMPLEX INPUT DATA 


IN ARRAYS X AND Y 


c 


LENGTH N 


WITE 


[ M 


FACTORS IN ARRAY NI 


c 


N 


= NI(1)*NI(2; 


)*. . .*NI(M) 


c 


REDUCED TEMP STORAGE IN SHORT WFTA MODULES 


c 


Has module; 


3 2,2 


1,4,! 


5,7,8,9,16 


c 


PROGRAM BY 


C. 


S. BURRUS, RICE UNIVERSITY 


c 

c 








SEPT 1983 


c 


SUBROUTINE PFA(X,Y,N ,M,NI) 




INTEGER NI(4), 


1(16), IP(16), LP(16) 




REAL X(l), 


Y(i: 


) 




DATA 


C31, 


C32 


/ - 


-0.86602540,-1.50000000 / 




DATA 


C51, 


C52 


/ 


0.95105652,-1.53884180 / 




DATA 


C53, 


C54 


/ - 


-0.36327126, 0.55901699 / 




DATA 


C55 




/ - 


-1.25 / 




DATA 


C71, 


C72 


/ - 


-1.16666667,-0.79015647 / 




DATA 


C73, 


C74 


/ 


0.055854267, 0.7343022 / 




DATA 


C75, 


C76 


/ 


0.44095855,-0.34087293 / 




DATA 


C77, 


C78 


/ 


0.53396936, 0.87484229 / 




DATA 


C81 




/ 


0.70710678 / 




DATA 


C95 




/ - 


-0.50000000 / 
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DATA 


C92, C93 / 0.93969262, 


-0.17364818 / 




DATA 


C94, C96 / 0.76604444, 


-0.34202014 / 




DATA 


C97, C98 / -0.98480775, 


-0.64278761 / 




DATA 


C162,C163 / 0.38268343, 


1.30656297 / 


C 
C- 

c 


DATA 


C164,C165 / 0.54119610, 


0.92387953 / 






wf^tfd t nnp^ 




DO 


10 


IMUjO L£jU LUUrij 

K=l, M 










Ml = MI(K) 








M2 


= M/Ml 








L 


= 1 








M3 


= M2 - N1*(N2/N1) 








DO 


15 J = 1, Ml 
LP (J) = L 

L = L + M3 
IF (L.GT.Ml) L = L - Ml 




c 


15 




CONTINUE 






DO 


20 J=l, N, Nl 










IT = J 










DO 30 L=l, Nl 










I(L) = IT 










IP(LP(L)) = IT 










IT = IT + N2 










IF (IT.GT.N) IT = IT 


- N 




30 




CONTINUE 
GOTO (20,102,103,104,105, 


,20,107,108,109, 




+ 




20,20,20,20,20, 


,20,116) ,N1 


c- 






UFTA M-9 




Hi In Iv z 




c 












102 


Rl = X(I(D) 






X(I(D) = Rl + X(I(2)) 
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X(I(2)) = Rl - X(I(2)) 
C 

Rl = Y(I(D) 

Y(IP(1)) = Rl + Y(I(2)) 

Y(IP(2)) = Rl - Y(I(2)) 
C 

GOTO 20 
C 

C WFTA M=3 

C 

103 R2 = (X(I(2)) - X(I(3))) * C31 
Rl = X(I(2)) + X(I(3)) 
X(I(1))= X(I(D) + Rl 

Rl = X(I(D) + Rl * C32 
C 

S2 = (Y(I(2)) - Y(I(3))) * C31 

SI = Y(I(2)) + Y(I(3)) 

Y(I(1))= Y(I(D) + SI 

SI = Y(I(D) + SI * C32 
C 

X(IP(2)) = Rl - S2 

X(IP(3)) = Rl + S2 

Y(IP(2)) = SI + R2 

Y(IP(3)) = SI - R2 
C 

GOTO 20 
C 

C WFTA M=4 

C 

104 Rl = X(I(D) + X(I(3)) 
Tl = X(I(D) - X(I(3)) 

R2 = X(I(2)) + X(I(4)) 
X(IP(1)) = Rl + R2 
X(IP(3)) = Rl - R2 
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Rl = Y(I(D) + Y(I(3)) 
T2 = Y(I(D) - Y(I(3)) 
R2 = Y(I(2)) + Y(I(4)) 
Y(IP(1)) = Rl + R2 
Y(IP(3)) = Rl - R2 

Rl = X(I(2)) - X(I(4)) 
R2 = Y(I(2)) - Y(I(4)) 

X(IP(2)) = Tl + R2 

X(IP(4)) = Tl - R2 

Y(IP(2)) = T2 - Rl 

Y(IP(4)) = T2 + Rl 

GOTO 20 



C WFTA M=5 

C 

105 Rl = X(I(2)) + X(I(5)) 
R4 = X(I(2)) - X(I(5)) 
R3 = X(I(3)) + X(I(4)) 
R2 = X(I(3)) - X(I(4)) 



T = 


(Rl 


- 


R3) * C54 




Rl 


= Rl 


+ 


R3 




x(KD) 


= 


x(KD) + 


Rl 


Rl 




= 


x(KD) + 


Rl 


R3 


= Rl 


_ 


T 




Rl 


= Rl 


+ 


T 




T = 


(R4 


+ 


R2) * C51 




R4 


= T 


+ 


R4 * C52 





* C55 
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R2 = T + R2 * C53 
C 

51 = Y(I(2)) + Y(I(5)) 
S4 = Y(I(2)) - Y(I(5)) 
S3 = Y(I(3)) + Y(I(4)) 

52 = Y(I(3)) - Y(I(4)) 
C 

T = (SI - S3) * C54 
SI = SI + S3 
Y(I(D) = Y(I(D) + SI 
SI = Y(I(D) + SI * C55 
C 

53 = SI - T 

51 = SI + T 
C 

T = (S4 + S2) * C51 

54 = T + S4 * C52 

52 = T + S2 * C53 
C 

X(IP(2)) = Rl + S2 

X(IP(5)) = Rl - S2 

X(IP(3)) = R3 - S4 

X(IP(4)) = R3 + S4 
C 

Y(IP(2)) = SI - R2 

Y(IP(5)) = SI + R2 

Y(IP(3)) = S3 + R4 

Y(IP(4)) = S3 - R4 
C 

GOTO 20 



C WFTA M=7 

C 

107 Rl = X(I(2)) + X(I(7)) 
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R6 


= X(I(2)) - 


X(I(7)) 


SI 


= Y(I(2)) + 


Y(I(7)) 


S6 


= Y(I(2)) - 


Y(I(7)) 


R2 


= X(I(3)) + 


X(I(6)) 


R5 


= X(I(3)) - 


X(I(6)) 


S2 


= Y(I(3)) + 


Y(I(6)) 


S5 


= Y(I(3)) - 


Y(I(6)) 


R3 


= X(I(4)) + 


X(I(5)) 


R4 


= X(I(4)) - 


X(I(5)) 


S3 


= Y(I(4)) + 


Y(I(5)) 


S4 
C 

T3 


= Y(I(4)) - 


Y(I(5)) 


= (Rl - R2) 


* C74 


T 


= (Rl - R3) 


* C72 


Rl 


= Rl + R2 + 


R3 


X(I(D) - X(I(D) + Rl 


Rl 


= X(I(D) + Rl * C71 


R2 


=(R3 - R2) 


* C73 


R3 


= Rl - T + 


R2 


R2 


= Rl - R2 - 


T3 


Rl 


= Rl + T + 


T3 


T = 


(R6 - R5) 


* C78 


T3 


=(R6 + R4) 


* C76 


R6 


=(R6 + R5 - 


R4) * C75 


R5 


=(R5 + R4) 


* C77 


R4 


= R6 - T3 + 


R5 


R5 


= R6 - R5 - 


T 


R6 
C 

T3 


= R6 + T3 + 


T 


= (SI - S2) 


* C74 


T 


= (SI - S3) 


* C72 


SI 


= SI + S2 


+ S3 


Y(I(D) = Y(I(D) + SI 


SI 


= Y(I(D) + SI * C71 
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S2 =(S3 - 


- S2) 


* C73 






S3 = SI - 


- T + 


S2 






S2 = SI - 


- S2 - 


T3 






SI = SI + T + 


T3 






T = (S6 


- S5) 


* C78 






T3 = (S6 


+ S4) 


* C76 






S6 = (S6 


+ S5 


- S4) * C75 






S5 = (S5 


+ S4) 


* C77 






S4 = S6 - 


- T3 + 


S5 






S5 = S6 - 


- S5 - 


T 






S6 = S6 + T3 + 


T 




C 












X(IP(2)) 


= R3 


+ S4 






X(IP(7)) 


= R3 


- S4 






X(IP(3)) 


= Rl 


+ S6 






X(IP(6)) 


= Rl 


- S6 






X(IP(4)) 


= R2 


- S5 






X(IP(5)) 


= R2 


+ S5 






Y(IP(4)) 


= S2 


+ R5 






Y(IP(5)) 


= S2 


- R5 






Y(IP(2)) 


= S3 


- R4 






Y(IP(7)) 


= S3 


+ R4 






Y(IP(3)) 


= SI 


- R6 






Y(IP(6)) 


= SI 


+ R6 




C 


GOTO 20 








c- 

c 






UFT A W-ft 








W r 1 ii l\l o 






108 Rl = 


x(KD) + x(K5)) 






R2 = X(I(D) - 


X(I(5)) 






R3 = X(I(2)) + 


X(I(8)) 






R4 = X(I(2)) - 


X(I(8)) 






R5 = X(I(3)) + 


X(I(7)) 
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R6 


= X(I(3)) - 


X(I(7)) 


R7 


= X(I(4)) + 


X(I(6)) 


R8 


= X(I(4)) - 


X(I(6)) 


Tl 


= Rl + R5 




T2 


= Rl - R5 




T3 


= R3 + R7 




R3 


=(R3 - R7) 


* C81 


X(IP(1)) = Tl 


+ T3 


X(IP(5)) = Tl 


- T3 


Tl 


= R2 + R3 




T3 


= R2 - R3 




SI 


= R4 - R8 




R4 


=(R4 + R8) 


* C81 


S2 


= R4 + R6 




S3 


= R4 - R6 




Rl 


= Y(I(D) + 


Y(I(5)) 


R2 


= Y(I(D) - 


Y(I(5)) 


R3 


= Y(I(2)) + 


Y(I(8)) 


R4 


= Y(I(2)) - 


Y(I(8)) 


R5 


= Y(I(3)) + 


Y(I(7)) 


R6 


= Y(I(3)) - 


Y(I(7)) 


R7 


= Y(I(4)) + 


Y(I(6)) 


R8 


= Y(I(4)) - 


Y(I(6)) 


T4 


= Rl + R5 




Rl 


= Rl - R5 




R5 


= R3 + R7 




R3 


=(R3 - R7) 


* C81 


Y(IP(1)) = T4 


+ R5 


Y(IP(5)) = T4 


- R5 


R5 


= R2 + R3 




R2 


= R2 - R3 




R3 


= R4 - R8 




R4 


=(R4 + R8) 


* C81 


R7 


= R4 + R6 
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R4 = R4 

X(IP(2 

X(IP(8 

X(IP(3 

X(IP(7 

X(IP(4 

X(IP(6 

Y(IP(2 

Y(IP(8 

Y(IP(3 

Y(IP(7 

Y(IP(4 

Y(IP(6 



R6 
= Tl 
= Tl 
= T2 
= T2 
= T3 
= T3 
= R5 
= R5 
= Rl 
= Rl 
= R2 
= R2 



R7 
R7 
R3 
R3 
R4 
R4 
S2 
S2 
SI 
SI 
S3 
S3 



GOTO 20 



C WFTA N=9 

C 

109 Rl = X(I(2)) + X(I(9)) 



R2 = X(I(2) 
R3 = X(I(3) 
R4 = X(I(3) 
R5 = X(I(4) 
T8 =(X(I(4) 
R7 = X(I(5) 
R8 = X(I(5) 
TO = X(I(1)) + 
T7 = X(I(1); + 
R5 = Rl + R3 + 
X(I(D) = TO + 
T5 = TO + R5 * 
T3 = (R3 - R7) 
R7 = (Rl - R7) 
R3 = (Rl - R3) 



X(I(9)) 

X(I(8)) 

X(I(8)) 

X(I(7)) 

X(I(7))) 

X(I(6)) 

X(I(6)) 

R5 

R5 * C95 

R7 

R5 

C95 

* C92 

* C93 

* C94 



* C31 
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Tl 


= T7 + T3 + 


R3 


T3 


= T7 - T3 - 


R7 


T7 


= T7 + R7 - 


R3 


T6 


= (R2 - R4 - 


t- R8) * C31 


T4 


= (R4 + R8) 


* G96 


R8 


= (R2 - R8) 


* C97 


R2 


= (R2 + R4) 


* C98 


T2 


= T8 + T4 + 


R2 


T4 


= T8 - T4 - 


R8 


T8 
C 

Rl 


= T8 + R8 - 


R2 


= Y(I(2)) + 


Y(I(9)) 


R2 


= Y(I(2)) - 


Y(I(9)) 


R3 


= Y(I(3)) + 


Y(I(8)) 


R4 


= Y(I(3)) - 


Y(I(8)) 


R5 


= Y(I(4)) + 


Y(I(7)) 


R6 


=(Y(I(4)) - 


Y(I(7))) * 


R7 


= Y(I(5)) + 


Y(I(6)) 


R8 


= Y(I(5)) - 


Y(I(6)) 


TO 


= Y(I(D) + 


R5 


T9 


= Y(I(D) + 


R5 * C95 


R5 


= Rl + R3 + 


R7 


Y(I(D) - TO + 


R5 


R5 


= TO + R5 * 


C95 


TO 


= (R3 - R7) 


* C92 


R7 


= (Rl - R7) 


* C93 


R3 


= (Rl - R3) 


* C94 


Rl 


= T9 + TO + 


R3 


TO 


= T9 - TO - 


R7 


R7 


= T9 + R7 - 


R3 


R9 


= (R2 - R4 - 


t- R8) * C31 


R3 


= (R4 + R8) 


* C96 


R8 


= (R2 - R8) 


* C97 


R4 


= (R2 + R4) 


* C98 
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R2 
R3 
R8 



R6 + R3 
R6 - R8 
R6 + R8 



R4 
R3 
R4 



X(IP(2 
X(IP(9 
Y(IP(2 
Y(IP(9 
X(IP(3 
X(IP(8 
Y(IP(3 
Y(IP(8 
X(IP(4 
X(IP(7 
Y(IP(4 
Y(IP(7 
X(IP(5 
X(IP(6 
Y(IP(5 
Y(IP(6 



Tl 
Tl 
Rl 
Rl 
T3 
T3 
TO 
TO 
T5 
T5 
R5 
R5 
T7 
T7 
R7 
R7 



R2 
R2 
T2 
T2 
R3 
R3 
T4 
T4 
R9 
R9 
T6 
T6 
R8 
R8 
T8 
T8 



GOTO 20 



C WFTA N=16 

C 

Rl = X(I(D) + X(I(9)) 



116 
R2 
R3 

R4 
R5 
R6 
R7 
R8 
R9 



X(I(1) 
X(I(2) 
X(I(2) 
X(I(3) 
X(I(3) 
X(I(4) 
X(I(4) 
X(I(5) 



X(I(9)) 

X(I(10)) 

X(I(10)) 

X(I(1D) 

X(I(1D) 

X(I(12)) 

X(I(12)) 

X(I(13)) 
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R10= X(I(5)) - 


X(I(13)) 


Rll = X(I(6)) + 


X(I(14)) 


R12 = X(I(6)) - 


X(I(14)) 


R13 = X(I(7)) + 


X(I(15)) 


R14 = X(I(7)) - 


X(I(15)) 


R15 = X(I(8)) + 


X(I(16)) 


R16 = X(I(8)) - 


X(I(16)) 


Tl = Rl + R9 




T2 = Rl - R9 




T3 = R3 + Rll 




T4 = R3 - Rll 




T5 = R5 + R13 




T6 = R5 - R13 




T7 = R7 + R15 




T8 = R7 - R15 




Rl = Tl + T5 




R3 = Tl - T5 




R5 = T3 + T7 




R7 = T3 - T7 




X(IP( D) = Rl 


+ R5 


X(IP( 9)) = Rl 


- R5 


Tl = C81 * (T4 


+ T8) 


T5 = C81 * (T4 


- T8) 


R9 = T2 + T5 




Rll= T2 - T5 




R13 = T6 + Tl 




R15 = T6 - Tl 




Tl = R4 + R16 




T2 = R4 - R16 




T3 = C81 * (R6 


+ R14) 


T4 = C81 * (R6 


- R14) 


T5 = R8 + R12 




T6 = R8 - R12 




T7 = C162 * (T2 


- T6) 



291 



292 



APPENDIX 



T2 = 
T6 = 
T7 = 
T8 = 
R2 = 
R4 = 
R6 = 
R8 = 
T7 = 
T2 = 
T4 = 
T6 = 
T8 = 
RIO 
R12 
R14 
R16 = T8 



C163 
C164 

R2 + 

R2 - 

T7 + 

T7 - 

T8 + 

T8 - 

C165 

T7 - 

T7 - 

RIO h 

RIO - 

= T6 h 

= T6 - 

= T8 h 



T7 
T7 



Rl 

S2 
S3 
S4 
R5 
S6 
S7 
S8 
S9 



Y(I(1 
Y(I(1 
Y(I(2 
Y(I(2 
Y(I(3 
Y(I(3 
Y(I(4 
Y(I(4 
Y(I(5 



* T2 

* T6 
T4 
T4 
T2 
T2 
T6 
T6 

* (Tl + T5) 
C164 * Tl 
C163 * T5 

T3 
T3 
T2 
T2 
T4 
T4 
+ 



S10= Y(I(5 



Sll 
S12 
S13 
S14 
S15 
S16 
Tl = 



= Y(I(6 
= Y(I(6 
= Y(I(7 
= Y(I(7 
= Y(I(8 
= Y(I(8 
Rl + S9 



Y(I(9)) 

Y(I(9)) 

Y(I(10)) 

Y(I(10)) 

Y(I(1D) 

Y(I(1D) 

Y(I(12)) 

Y(I(12)) 

Y(I(13)) 

Y(I(13)) 

■ Y(I(14)) 
Y(I(14)) 

■ Y(I(15)) 
Y(I(15)) 
Y(I(16)) 
Y(I(16)) 
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T2 = Rl - S9 

T3 = S3 + Sll 

T4 = S3 - Sll 

T5 = R5 + S13 

T6 = R5 - S13 

T7 = S7 + S15 

T8 = S7 - S15 

Rl = Tl + T5 

S3 = Tl - T5 

R5 = T3 + T7 

S7 = T3 - T7 

Y(IP( 1)) = Rl + R5 

Y(IP( 9)) = Rl - R5 

X(IP( 5)) = R3 + S7 

X(IP(13)) = R3 - S7 

Y(IP( 5)) = S3 - R7 

Y(IP(13)) = S3 + R7 

Tl = C81 * (T4 + T8) 

T5 = C81 * (T4 - T8) 

S9 = T2 + T5 

Sll= T2 - T5 

S13 = T6 + Tl 

S15 = T6 - Tl 

Tl = S4 + S16 

T2 = S4 - S16 

T3 = C81 * (S6 + S14) 

T4 = C81 * (S6 - S14) 

T5 = S8 + S12 

T6 = S8 - S12 

T7 = C162 * (T2 - T6) 

T2 = C163 * T2 - T7 

T6 = C164 * T6 - T7 

T7 = S2 + T4 

T8 = S2 - T4 
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S2 = 


T7 - 


i- T2 




S4 = 


T7 - 


- T2 




S6 = 


T8 - 


i- T6 




S8 = 


T8 - 


- T6 




T7 = 


C165 * (Tl 


+■ T5) 


T2 = 


T7 - 


- C164 * 


Tl 


T4 = 


T7 - 


- C163 * 


T5 


T6 = 


S10 


+ T3 




T8 = 


S10 


- T3 




S10 = 


= T6 


+ T2 




S12 = 


= T6 


- T2 




S14 = 


= T8 


+ T4 




S16 = 


= T8 


- T4 




X(IP 


: 2); 


) = R2 + 


S10 


X(IP 


:i6); 


) = R2 - 


S10 


Y(IP 


: 2); 


) = S2 - 


RIO 


Y(IP 


:i6); 


) = S2 + 


RIO 


X(IP 


: 3); 


) = R9 + 


S13 


X(IP 


:i5); 


) = R9 - 


S13 


Y(IP 


: 3); 


) = S9 - 


R13 


Y(IP 


:i5); 


) = S9 + 


R13 


X(IP 


: 4); 


) = R8 - 


S16 


X(IP 


'i4); 


) = R8 + 


S16 


Y(IP 


: 4); 


) = S8 + 


R16 


Y(IP 


'i4); 


) = S8 - 


R16 


X(IP 


: e); 


) = R6 + 


S14 


X(IP 


'i2); 


) = R6 - 


S14 


Y(IP 


: e); 


) = S6 - 


R14 


Y(IP 


'12); 


) = S6 + 


R14 


X(IP 


: 7); 


) = Rll 


- S15 


X(IP 


'id; 


) = Rll 


+■ S15 


Y(IP 


: 7); 


) = Sll 


+■ R15 


Y(IP 


'id; 


) = Sll 


- R15 


X(IP 


: s); 


) = R4 - 


S12 
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X(IP(10)) = R4 + S12 

Y(IP( 8)) = S4 + R12 

Y(IP(10)) = S4 - R12 
C 

GOTO 20 
C 

20 CONTINUE 
10 CONTINUE 

RETURN 

END 
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Appendix 4: Programs for 
Short FFTs 



This appendix will discuss efficient short FFT programs that can 
be used in both the Cooley-Tukey (Chapter 9) and the Prime Fac- 
tor FFT algorithms (Chapter 10). Links and references are given to 
Fortran listings that can be used "as is" or put into the indexed loops 
of existing programs to give greater efficiency and/or a greater va- 
riety of allowed lengths. Special programs have been written for 
lengths: N = 2, 3, 4, 5, 7, 8, 9, 11, 13, 16, 17, 19, 25, etc. 

In the early days of the FFT, multiplication was done in software 
and was, therefore, much slower than an addition. With modem 
hardware, a floating point multiplication can be done in one clock 
cycle of the computer, microprocessor, or DSP chip, requiring the 
same time as an addition. Indeed, in some computers and many 
DSP chips, both a multiplication and an addition (or accumulation) 
can be done in one cycle while the indexing and memory access is 
done in parallel. Most of the algorithms described here are not 
hardware architecture specific but are designed to minimize both 
multiplications and additions. 

The most basic and often used length FFT (or DFT) is for N — 2. 
In the Cooley Tukey FFT, it is called a "butterfly" and its reason for 
fame is requiring no multiplications at all, only one complex addi- 



lr rhis content is available online at <http://cnx.org/content/ml7646/L4/>. 
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tion and one complex subtraction and needing only one complex 
temporary storage location. This is illustrated in Figure 1: The 
Prime Factor and Winograd Transform Algorithms (Figure 10.1) 
and code is shown in Figure 2: The Prime Factor and Winograd 
Transform Algorithms (Figure 10.2). The second most used length 
is N — 4 because it is the only other short length requiring no multi- 
plications and a minimum of additions. All other short FFT require 
some multiplication but for powers of two, N — 8 and N — 16 re- 
quire few enough to be worth special coding for some situations. 

Code for other short lengths such as the primes N — 
3, 5, 7, 11, 13, 17, and 19 and the composites N — 9 and 25 
are included in the programs for the prime factor algorithm or the 
WFTA. They are derived using the theory in Chapters 5,6, and 9. 
They can also be found in references ... and 

If these short FFTs are used as modules in the basic prime factor 
algorithm (PFA), then the straight forward development used for 
the modules in Figure 17.12 are used. However if the more com- 
plicated indexing use to achieve in-order, in-place calculation used 
in {xxxxx} require different code. 

For each of the indicated lengths, the computer code is given in a 
Connexions module. 

They are not in the collection Fast Fourier Transforms 2 as the 
printed version would be too long. However, one can link to them 
on-line from the following buttons: 



N=2 3 



N=3 4 
M=4 5 



2 Fast Fourier Transforms <http://cnx.org/content/coll0550/latest/> 
3 "N=2" <http://cnx.org/content/ml7625/latest/> 
4 "N=3 " <http ://cnx . org/content/m 1 7626/latest/> 
5 "N=4" <http://cnx.org/content/ml7627/latest/> 
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N= 


5 6 


N= 


7 7 


N= 


8 


N= 


9 


N= 


11 


N= 


13 


N= 


16 


N= 


17 


N= 


19 


N= 


25 
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Versions for the in-place, in-order prime factor algorithm {pfa} can 
be obtained from: 

N=2 8 

N=3 9 
N=4 10 

N=5 n 

N=7 12 

N=8 13 

N=9 14 

N=ll 15 

N=13 16 

N=16 17 



6 "N=5" <http://cnx.org/content/ml7628/latest/> 

7 "N=7" <http://cnx.org/content/ml7629/latest/> 

8 "pN=2" <http://cnx.org/content/ml7631/latest/> 

9 "pN=3" <http://cnx.org/content/ml7632/latest/> 
10 "pN=4" <http://cnx.org/content/ml7633/latest/> 
1 1 "pN=5" <http://cnx.org/content/ml7634/latest/> 
12 "pN=7" <http://cnx.org/content/ml7635/latest/> 
13 "pN=8" <http://cnx.org/content/ml7636/latest/> 
14 "pN=9" <http://cnx.org/content/ml7637/latest/> 

15 "N =11 Winograd FFT module" <http://cnx.org/content/ml7377/latest/> 
16 "N = 13 Winograd FFT module" <http://cnx.org/content/ml7378/latest/> 
17 "N = 16 FFT module" <http://cnx.org/content/ml7382/latest/> 
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N=17 18 



N=19 19 
N=25 20 



A technical report that describes the length 11, 13, 17, and 19 is 
in {report 8105} and another technical report that describes a pro- 
gram that will automatically generate a prime length FFT and its 
flow graph si in {report xxx}. 



18 "N = 17 Winograd FFT module" <http://cnx.org/content/ml7380/latest/> 
19 "N = 19 Winograd FFT module" <http://cnx.org/content/ml7381/latest/> 
20 "N = 25 FFT module" <http://cnx.org/content/ml7383/latest/> 
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